2024-01-31T21:45:37.1549726Z Requested labels: linux.g5.4xlarge.nvidia.gpu 2024-01-31T21:45:37.1550008Z Job defined at: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/tags/ciflow/slow/118586 2024-01-31T21:45:37.1550308Z Reusable workflow chain: 2024-01-31T21:45:37.1550450Z pytorch/pytorch/.github/workflows/slow.yml@refs/tags/ciflow/slow/118586 (86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528) 2024-01-31T21:45:37.1550581Z -> pytorch/pytorch/.github/workflows/_linux-test.yml@refs/tags/ciflow/slow/118586 (86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528) 2024-01-31T21:45:37.1550744Z Waiting for a runner to pick up this job... 2024-01-31T22:06:19.8763194Z Job is about to start running on the runner: i-0f7dd096d568eb180 (organization) 2024-01-31T22:06:24.5267447Z Current runner version: '2.312.0' 2024-01-31T22:06:24.5273632Z Runner name: 'i-0f7dd096d568eb180' 2024-01-31T22:06:24.5274461Z Runner group name: 'Default' 2024-01-31T22:06:24.5275412Z Machine name: 'ip-10-0-30-139' 2024-01-31T22:06:24.5280146Z ##[group]GITHUB_TOKEN Permissions 2024-01-31T22:06:24.5282157Z Actions: read 2024-01-31T22:06:24.5282722Z Checks: read 2024-01-31T22:06:24.5283239Z Contents: read 2024-01-31T22:06:24.5283741Z Deployments: read 2024-01-31T22:06:24.5284301Z Discussions: read 2024-01-31T22:06:24.5285040Z Issues: read 2024-01-31T22:06:24.5285594Z Metadata: read 2024-01-31T22:06:24.5286132Z Packages: read 2024-01-31T22:06:24.5286642Z Pages: read 2024-01-31T22:06:24.5287126Z PullRequests: read 2024-01-31T22:06:24.5287738Z RepositoryProjects: read 2024-01-31T22:06:24.5288389Z SecurityEvents: read 2024-01-31T22:06:24.5288924Z Statuses: read 2024-01-31T22:06:24.5289458Z ##[endgroup] 2024-01-31T22:06:24.5292519Z Secret source: Actions 2024-01-31T22:06:24.5293348Z Prepare workflow directory 2024-01-31T22:06:24.7889444Z Prepare all required actions 2024-01-31T22:06:24.8068365Z Getting action download info 2024-01-31T22:06:24.9696504Z Download action repository 'pytorch/test-infra@main' (SHA:3009cb9119fdcc3fb4dd1613c9d8d3fc04c53004) 2024-01-31T22:06:25.2774783Z Download action repository 'pytorch/pytorch@main' (SHA:e426924c19500011aa419accc251a923fe857b94) 2024-01-31T22:06:28.0077776Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2024-01-31T22:06:28.2572535Z Getting action download info 2024-01-31T22:06:28.3573419Z Download action repository 'malfet/checkout@silent-checkout' (SHA:e07af140b3ccefc05679e3755b9db68f4ee4589c) 2024-01-31T22:06:28.5570900Z Getting action download info 2024-01-31T22:06:28.6728671Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2024-01-31T22:06:28.8213996Z Uses: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/tags/ciflow/slow/118586 (86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528) 2024-01-31T22:06:28.8216087Z ##[group] Inputs 2024-01-31T22:06:28.8216692Z build-environment: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:06:28.8218779Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}]} 2024-01-31T22:06:28.8221292Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:06:28.8222400Z sync-tag: 2024-01-31T22:06:28.8223185Z timeout-minutes: 300 2024-01-31T22:06:28.8223593Z use-gha: 2024-01-31T22:06:28.8223962Z dashboard-tag: 2024-01-31T22:06:28.8224346Z ##[endgroup] 2024-01-31T22:06:28.8225393Z Complete job name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:06:28.9062713Z ##[group]Run pytorch/test-infra/.github/actions/setup-ssh@main 2024-01-31T22:06:28.9063345Z with: 2024-01-31T22:06:28.9064003Z github-secret: *** 2024-01-31T22:06:28.9065010Z instructions: All testing is done inside the container, to start an interactive session run: docker exec -it $(docker container ps --format '{{.ID}}') bash 2024-01-31T22:06:28.9066130Z activate-with-label: false 2024-01-31T22:06:28.9066571Z label: with-ssh 2024-01-31T22:06:28.9066967Z remove-existing-keys: true 2024-01-31T22:06:28.9067399Z env: 2024-01-31T22:06:28.9067755Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:06:28.9068173Z ##[endgroup] 2024-01-31T22:06:28.9989123Z Please see https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions for more info. 2024-01-31T22:06:28.9990237Z ciflow reference detected, attempting to extract PR number 2024-01-31T22:06:29.2434449Z Grabbing public ssh keys from https://github.com/pytorch-bot[bot].keys 2024-01-31T22:06:29.3144143Z No SSH keys found for user pytorch-bot[bot] 2024-01-31T22:06:29.3144878Z Grabbing public ssh keys from https://github.com/clee2000.keys 2024-01-31T22:06:29.3783088Z ~/.ssh/authorized_keys file found on node, removing ~/.ssh and starting fresh 2024-01-31T22:06:29.3797411Z Public keys pulled and installed to /home/ec2-user/.ssh/authorized_keys 2024-01-31T22:06:29.3821862Z Login using: ssh ec2-user@ec2-3-92-87-55.compute-1.amazonaws.com 2024-01-31T22:06:29.3822720Z All testing is done inside the container, to start an interactive session run: 2024-01-31T22:06:29.3823609Z docker exec -it $(docker container ps --format '{{.ID}}') bash 2024-01-31T22:06:29.4062369Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2024-01-31T22:06:29.4062928Z with: 2024-01-31T22:06:29.4063197Z submodules: recursive 2024-01-31T22:06:29.4063532Z fetch-depth: 0 2024-01-31T22:06:29.4063833Z env: 2024-01-31T22:06:29.4064098Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:06:29.4064440Z ##[endgroup] 2024-01-31T22:06:29.4278751Z ##[group]Run retry () { 2024-01-31T22:06:29.4279181Z retry () { 2024-01-31T22:06:29.4279689Z  $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*) 2024-01-31T22:06:29.4280243Z } 2024-01-31T22:06:29.4280547Z echo "${GITHUB_WORKSPACE}" 2024-01-31T22:06:29.4280955Z if [ -z "${NO_SUDO}" ]; then 2024-01-31T22:06:29.4281427Z  retry sudo rm -rf "${GITHUB_WORKSPACE}" 2024-01-31T22:06:29.4281901Z else 2024-01-31T22:06:29.4282236Z  retry rm -rf "${GITHUB_WORKSPACE}" 2024-01-31T22:06:29.4282662Z fi 2024-01-31T22:06:29.4283012Z mkdir "${GITHUB_WORKSPACE}" 2024-01-31T22:06:29.4305956Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:06:29.4306464Z env: 2024-01-31T22:06:29.4306802Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:06:29.4307209Z NO_SUDO: 2024-01-31T22:06:29.4307549Z ##[endgroup] 2024-01-31T22:06:29.4402228Z /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-01-31T22:06:32.1812956Z ##[group]Run malfet/checkout@silent-checkout 2024-01-31T22:06:32.1813423Z with: 2024-01-31T22:06:32.1813751Z ref: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:06:32.1814193Z fetch-depth: 0 2024-01-31T22:06:32.1814493Z submodules: recursive 2024-01-31T22:06:32.1814832Z quiet-checkout: true 2024-01-31T22:06:32.1815179Z repository: pytorch/pytorch 2024-01-31T22:06:32.1815666Z token: *** 2024-01-31T22:06:32.1815950Z ssh-strict: true 2024-01-31T22:06:32.1816275Z persist-credentials: true 2024-01-31T22:06:32.1816630Z clean: true 2024-01-31T22:06:32.1816953Z sparse-checkout-cone-mode: true 2024-01-31T22:06:32.1817339Z lfs: false 2024-01-31T22:06:32.1817638Z set-safe-directory: true 2024-01-31T22:06:32.1817966Z env: 2024-01-31T22:06:32.1818239Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:06:32.1818592Z ##[endgroup] 2024-01-31T22:06:32.2742721Z Syncing repository: pytorch/pytorch 2024-01-31T22:06:32.2744290Z ##[group]Getting Git version info 2024-01-31T22:06:32.2745182Z Working directory is '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2024-01-31T22:06:32.2746191Z [command]/usr/bin/git version 2024-01-31T22:06:32.2746555Z git version 2.40.1 2024-01-31T22:06:32.2747909Z ##[endgroup] 2024-01-31T22:06:32.2759171Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/5e1ddec6-987d-45b2-80a2-914bbbe9bce7' before making global git config changes 2024-01-31T22:06:32.2760437Z Adding repository directory to the temporary git global config as a safe directory 2024-01-31T22:06:32.2761564Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-01-31T22:06:32.2788966Z Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2024-01-31T22:06:32.2792708Z ##[group]Initializing the repository 2024-01-31T22:06:32.2795628Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-01-31T22:06:32.2819693Z hint: Using 'master' as the name for the initial branch. This default branch name 2024-01-31T22:06:32.2820554Z hint: is subject to change. To configure the initial branch name to use in all 2024-01-31T22:06:32.2821366Z hint: of your new repositories, which will suppress this warning, call: 2024-01-31T22:06:32.2821922Z hint: 2024-01-31T22:06:32.2822529Z hint: git config --global init.defaultBranch 2024-01-31T22:06:32.2823012Z hint: 2024-01-31T22:06:32.2823535Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2024-01-31T22:06:32.2824393Z hint: 'development'. The just-created branch can be renamed via this command: 2024-01-31T22:06:32.2825142Z hint: 2024-01-31T22:06:32.2825466Z hint: git branch -m 2024-01-31T22:06:32.2826248Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/ 2024-01-31T22:06:32.2828699Z [command]/usr/bin/git remote add origin https://github.com/pytorch/pytorch 2024-01-31T22:06:32.2849204Z ##[endgroup] 2024-01-31T22:06:32.2849795Z ##[group]Disabling automatic garbage collection 2024-01-31T22:06:32.2851962Z [command]/usr/bin/git config --local gc.auto 0 2024-01-31T22:06:32.2876338Z ##[endgroup] 2024-01-31T22:06:32.2877149Z ##[group]Setting up auth 2024-01-31T22:06:32.2883336Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2024-01-31T22:06:32.2905304Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2024-01-31T22:06:32.3102557Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2024-01-31T22:06:32.3125332Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2024-01-31T22:06:32.3325756Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2024-01-31T22:06:32.3378284Z ##[endgroup] 2024-01-31T22:06:32.3378838Z ##[group]Fetching the repository 2024-01-31T22:06:32.3384474Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules --quiet origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2024-01-31T22:06:43.1753084Z remote: Enumerating objects: 1248854 2024-01-31T22:06:43.1753665Z remote: Enumerating objects: 1251433, done. 2024-01-31T22:06:43.1755184Z remote: Counting objects: 0% (1/2579) 2024-01-31T22:06:43.1755706Z remote: Counting objects: 1% (26/2579) 2024-01-31T22:06:43.1756354Z remote: Counting objects: 2% (52/2579) 2024-01-31T22:06:43.1758619Z remote: Counting objects: 3% (78/2579) 2024-01-31T22:06:43.1759200Z remote: Counting objects: 4% (104/2579) 2024-01-31T22:06:43.1759800Z remote: Counting objects: 5% (129/2579) 2024-01-31T22:06:43.1760311Z remote: Counting objects: 6% (155/2579) 2024-01-31T22:06:43.1760837Z remote: Counting objects: 7% (181/2579) 2024-01-31T22:06:43.1761629Z remote: Counting objects: 8% (207/2579) 2024-01-31T22:06:43.1763097Z remote: Counting objects: 9% (233/2579) 2024-01-31T22:06:43.1763693Z remote: Counting objects: 10% (258/2579) 2024-01-31T22:06:43.1764455Z remote: Counting objects: 11% (284/2579) 2024-01-31T22:06:43.1765242Z remote: Counting objects: 12% (310/2579) 2024-01-31T22:06:43.1766396Z remote: Counting objects: 13% (336/2579) 2024-01-31T22:06:43.1769424Z remote: Counting objects: 14% (362/2579) 2024-01-31T22:06:43.1770180Z remote: Counting objects: 15% (387/2579) 2024-01-31T22:06:43.1770923Z remote: Counting objects: 16% (413/2579) 2024-01-31T22:06:43.1771674Z remote: Counting objects: 17% (439/2579) 2024-01-31T22:06:43.1772891Z remote: Counting objects: 18% (465/2579) 2024-01-31T22:06:43.1773604Z remote: Counting objects: 19% (491/2579) 2024-01-31T22:06:43.1774309Z remote: Counting objects: 20% (516/2579) 2024-01-31T22:06:43.1774808Z remote: Counting objects: 21% (542/2579) 2024-01-31T22:06:43.1775313Z remote: Counting objects: 22% (568/2579) 2024-01-31T22:06:43.1775813Z remote: Counting objects: 23% (594/2579) 2024-01-31T22:06:43.1776632Z remote: Counting objects: 24% (619/2579) 2024-01-31T22:06:43.1777306Z remote: Counting objects: 25% (645/2579) 2024-01-31T22:06:43.1777807Z remote: Counting objects: 26% (671/2579) 2024-01-31T22:06:43.1778314Z remote: Counting objects: 27% (697/2579) 2024-01-31T22:06:43.1778957Z remote: Counting objects: 28% (723/2579) 2024-01-31T22:06:43.1779760Z remote: Counting objects: 29% (748/2579) 2024-01-31T22:06:43.1780402Z remote: Counting objects: 30% (774/2579) 2024-01-31T22:06:43.1780927Z remote: Counting objects: 31% (800/2579) 2024-01-31T22:06:43.1781428Z remote: Counting objects: 32% (826/2579) 2024-01-31T22:06:43.1781936Z remote: Counting objects: 33% (852/2579) 2024-01-31T22:06:43.1782452Z remote: Counting objects: 34% (877/2579) 2024-01-31T22:06:43.1782941Z remote: Counting objects: 35% (903/2579) 2024-01-31T22:06:43.1783444Z remote: Counting objects: 36% (929/2579) 2024-01-31T22:06:43.1784326Z remote: Counting objects: 37% (955/2579) 2024-01-31T22:06:43.1784840Z remote: Counting objects: 38% (981/2579) 2024-01-31T22:06:43.1785358Z remote: Counting objects: 39% (1006/2579) 2024-01-31T22:06:43.1785899Z remote: Counting objects: 40% (1032/2579) 2024-01-31T22:06:43.1786436Z remote: Counting objects: 41% (1058/2579) 2024-01-31T22:06:43.1789371Z remote: Counting objects: 42% (1084/2579) 2024-01-31T22:06:43.1789901Z remote: Counting objects: 43% (1109/2579) 2024-01-31T22:06:43.1790433Z remote: Counting objects: 44% (1135/2579) 2024-01-31T22:06:43.1790953Z remote: Counting objects: 45% (1161/2579) 2024-01-31T22:06:43.1793896Z remote: Counting objects: 46% (1187/2579) 2024-01-31T22:06:43.1794465Z remote: Counting objects: 47% (1213/2579) 2024-01-31T22:06:43.1801552Z remote: Counting objects: 48% (1238/2579) 2024-01-31T22:06:43.1802084Z remote: Counting objects: 49% (1264/2579) 2024-01-31T22:06:43.1802606Z remote: Counting objects: 50% (1290/2579) 2024-01-31T22:06:43.1803130Z remote: Counting objects: 51% (1316/2579) 2024-01-31T22:06:43.1803650Z remote: Counting objects: 52% (1342/2579) 2024-01-31T22:06:43.1804186Z remote: Counting objects: 53% (1367/2579) 2024-01-31T22:06:43.1804718Z remote: Counting objects: 54% (1393/2579) 2024-01-31T22:06:43.1805459Z remote: Counting objects: 55% (1419/2579) 2024-01-31T22:06:43.1805974Z remote: Counting objects: 56% (1445/2579) 2024-01-31T22:06:43.1806503Z remote: Counting objects: 57% (1471/2579) 2024-01-31T22:06:43.1807027Z remote: Counting objects: 58% (1496/2579) 2024-01-31T22:06:43.1807542Z remote: Counting objects: 59% (1522/2579) 2024-01-31T22:06:43.1808066Z remote: Counting objects: 60% (1548/2579) 2024-01-31T22:06:43.1808585Z remote: Counting objects: 61% (1574/2579) 2024-01-31T22:06:43.1809099Z remote: Counting objects: 62% (1599/2579) 2024-01-31T22:06:43.1809941Z remote: Counting objects: 63% (1625/2579) 2024-01-31T22:06:43.1810466Z remote: Counting objects: 64% (1651/2579) 2024-01-31T22:06:43.1811149Z remote: Counting objects: 65% (1677/2579) 2024-01-31T22:06:43.1811672Z remote: Counting objects: 66% (1703/2579) 2024-01-31T22:06:43.1812197Z remote: Counting objects: 67% (1728/2579) 2024-01-31T22:06:43.1812720Z remote: Counting objects: 68% (1754/2579) 2024-01-31T22:06:43.1813242Z remote: Counting objects: 69% (1780/2579) 2024-01-31T22:06:43.1813886Z remote: Counting objects: 70% (1806/2579) 2024-01-31T22:06:43.1814409Z remote: Counting objects: 71% (1832/2579) 2024-01-31T22:06:43.1814945Z remote: Counting objects: 72% (1857/2579) 2024-01-31T22:06:43.1815469Z remote: Counting objects: 73% (1883/2579) 2024-01-31T22:06:43.1815996Z remote: Counting objects: 74% (1909/2579) 2024-01-31T22:06:43.1816521Z remote: Counting objects: 75% (1935/2579) 2024-01-31T22:06:43.1817037Z remote: Counting objects: 76% (1961/2579) 2024-01-31T22:06:43.1817560Z remote: Counting objects: 77% (1986/2579) 2024-01-31T22:06:43.1818080Z remote: Counting objects: 78% (2012/2579) 2024-01-31T22:06:43.1818661Z remote: Counting objects: 79% (2038/2579) 2024-01-31T22:06:43.1819181Z remote: Counting objects: 80% (2064/2579) 2024-01-31T22:06:43.1819702Z remote: Counting objects: 81% (2089/2579) 2024-01-31T22:06:43.1820232Z remote: Counting objects: 82% (2115/2579) 2024-01-31T22:06:43.1820758Z remote: Counting objects: 83% (2141/2579) 2024-01-31T22:06:43.1821280Z remote: Counting objects: 84% (2167/2579) 2024-01-31T22:06:43.1821796Z remote: Counting objects: 85% (2193/2579) 2024-01-31T22:06:43.1822317Z remote: Counting objects: 86% (2218/2579) 2024-01-31T22:06:43.1822839Z remote: Counting objects: 87% (2244/2579) 2024-01-31T22:06:43.1823356Z remote: Counting objects: 88% (2270/2579) 2024-01-31T22:06:43.1823877Z remote: Counting objects: 89% (2296/2579) 2024-01-31T22:06:43.1824398Z remote: Counting objects: 90% (2322/2579) 2024-01-31T22:06:43.1824917Z remote: Counting objects: 91% (2347/2579) 2024-01-31T22:06:43.1825448Z remote: Counting objects: 92% (2373/2579) 2024-01-31T22:06:43.1825974Z remote: Counting objects: 93% (2399/2579) 2024-01-31T22:06:43.1826497Z remote: Counting objects: 94% (2425/2579) 2024-01-31T22:06:43.1827018Z remote: Counting objects: 95% (2451/2579) 2024-01-31T22:06:43.1827539Z remote: Counting objects: 96% (2476/2579) 2024-01-31T22:06:43.1828063Z remote: Counting objects: 97% (2502/2579) 2024-01-31T22:06:43.1828581Z remote: Counting objects: 98% (2528/2579) 2024-01-31T22:06:43.1829143Z remote: Counting objects: 99% (2554/2579) 2024-01-31T22:06:43.1829676Z remote: Counting objects: 100% (2579/2579) 2024-01-31T22:06:43.1830235Z remote: Counting objects: 100% (2579/2579), done. 2024-01-31T22:06:43.1924135Z remote: Compressing objects: 0% (1/1079) 2024-01-31T22:06:43.1985234Z remote: Compressing objects: 1% (11/1079) 2024-01-31T22:06:43.2197172Z remote: Compressing objects: 2% (22/1079) 2024-01-31T22:06:43.2860869Z remote: Compressing objects: 3% (33/1079) 2024-01-31T22:06:43.3408716Z remote: Compressing objects: 4% (44/1079) 2024-01-31T22:06:43.3758778Z remote: Compressing objects: 5% (54/1079) 2024-01-31T22:06:43.4188316Z remote: Compressing objects: 6% (65/1079) 2024-01-31T22:06:43.4515367Z remote: Compressing objects: 7% (76/1079) 2024-01-31T22:06:43.4875678Z remote: Compressing objects: 8% (87/1079) 2024-01-31T22:06:43.4919631Z remote: Compressing objects: 9% (98/1079) 2024-01-31T22:06:43.5116554Z remote: Compressing objects: 10% (108/1079) 2024-01-31T22:06:43.5202190Z remote: Compressing objects: 11% (119/1079) 2024-01-31T22:06:43.5288790Z remote: Compressing objects: 12% (130/1079) 2024-01-31T22:06:43.5378567Z remote: Compressing objects: 13% (141/1079) 2024-01-31T22:06:43.5420527Z remote: Compressing objects: 14% (152/1079) 2024-01-31T22:06:43.5474296Z remote: Compressing objects: 15% (162/1079) 2024-01-31T22:06:43.5505051Z remote: Compressing objects: 16% (173/1079) 2024-01-31T22:06:43.5528313Z remote: Compressing objects: 17% (184/1079) 2024-01-31T22:06:43.5535264Z remote: Compressing objects: 18% (195/1079) 2024-01-31T22:06:43.5543307Z remote: Compressing objects: 19% (206/1079) 2024-01-31T22:06:43.5556802Z remote: Compressing objects: 20% (216/1079) 2024-01-31T22:06:43.5562555Z remote: Compressing objects: 21% (227/1079) 2024-01-31T22:06:43.5569931Z remote: Compressing objects: 22% (238/1079) 2024-01-31T22:06:43.5577520Z remote: Compressing objects: 23% (249/1079) 2024-01-31T22:06:43.5587205Z remote: Compressing objects: 24% (259/1079) 2024-01-31T22:06:43.5587744Z remote: Compressing objects: 25% (270/1079) 2024-01-31T22:06:43.5592166Z remote: Compressing objects: 26% (281/1079) 2024-01-31T22:06:43.5600460Z remote: Compressing objects: 27% (292/1079) 2024-01-31T22:06:43.5617294Z remote: Compressing objects: 28% (303/1079) 2024-01-31T22:06:43.5617840Z remote: Compressing objects: 29% (313/1079) 2024-01-31T22:06:43.5628311Z remote: Compressing objects: 30% (324/1079) 2024-01-31T22:06:43.5633996Z remote: Compressing objects: 31% (335/1079) 2024-01-31T22:06:43.5652474Z remote: Compressing objects: 32% (346/1079) 2024-01-31T22:06:43.5675899Z remote: Compressing objects: 33% (357/1079) 2024-01-31T22:06:43.5695727Z remote: Compressing objects: 34% (367/1079) 2024-01-31T22:06:43.5709623Z remote: Compressing objects: 35% (378/1079) 2024-01-31T22:06:43.5718474Z remote: Compressing objects: 36% (389/1079) 2024-01-31T22:06:43.5724377Z remote: Compressing objects: 37% (400/1079) 2024-01-31T22:06:43.5727342Z remote: Compressing objects: 38% (411/1079) 2024-01-31T22:06:43.5737483Z remote: Compressing objects: 39% (421/1079) 2024-01-31T22:06:43.5744610Z remote: Compressing objects: 40% (432/1079) 2024-01-31T22:06:43.5756497Z remote: Compressing objects: 41% (443/1079) 2024-01-31T22:06:43.5765816Z remote: Compressing objects: 42% (454/1079) 2024-01-31T22:06:43.5772783Z remote: Compressing objects: 43% (464/1079) 2024-01-31T22:06:43.5776184Z remote: Compressing objects: 44% (475/1079) 2024-01-31T22:06:43.5782710Z remote: Compressing objects: 45% (486/1079) 2024-01-31T22:06:43.5783259Z remote: Compressing objects: 46% (497/1079) 2024-01-31T22:06:43.5789837Z remote: Compressing objects: 47% (508/1079) 2024-01-31T22:06:43.5790381Z remote: Compressing objects: 48% (518/1079) 2024-01-31T22:06:43.5811311Z remote: Compressing objects: 49% (529/1079) 2024-01-31T22:06:43.5811854Z remote: Compressing objects: 50% (540/1079) 2024-01-31T22:06:43.5826332Z remote: Compressing objects: 51% (551/1079) 2024-01-31T22:06:43.5826896Z remote: Compressing objects: 52% (562/1079) 2024-01-31T22:06:43.5827437Z remote: Compressing objects: 53% (572/1079) 2024-01-31T22:06:43.5827982Z remote: Compressing objects: 54% (583/1079) 2024-01-31T22:06:43.5828528Z remote: Compressing objects: 55% (594/1079) 2024-01-31T22:06:43.5829059Z remote: Compressing objects: 56% (605/1079) 2024-01-31T22:06:43.5829602Z remote: Compressing objects: 57% (616/1079) 2024-01-31T22:06:43.5830141Z remote: Compressing objects: 58% (626/1079) 2024-01-31T22:06:43.5830678Z remote: Compressing objects: 59% (637/1079) 2024-01-31T22:06:43.5831224Z remote: Compressing objects: 60% (648/1079) 2024-01-31T22:06:43.5831766Z remote: Compressing objects: 61% (659/1079) 2024-01-31T22:06:43.5832291Z remote: Compressing objects: 62% (669/1079) 2024-01-31T22:06:43.5832835Z remote: Compressing objects: 63% (680/1079) 2024-01-31T22:06:43.5839700Z remote: Compressing objects: 64% (691/1079) 2024-01-31T22:06:43.5853444Z remote: Compressing objects: 65% (702/1079) 2024-01-31T22:06:43.5869131Z remote: Compressing objects: 66% (713/1079) 2024-01-31T22:06:43.5869687Z remote: Compressing objects: 67% (723/1079) 2024-01-31T22:06:43.5888758Z remote: Compressing objects: 68% (734/1079) 2024-01-31T22:06:43.5889314Z remote: Compressing objects: 69% (745/1079) 2024-01-31T22:06:43.5898462Z remote: Compressing objects: 70% (756/1079) 2024-01-31T22:06:43.5899022Z remote: Compressing objects: 71% (767/1079) 2024-01-31T22:06:43.5899569Z remote: Compressing objects: 72% (777/1079) 2024-01-31T22:06:43.5900110Z remote: Compressing objects: 73% (788/1079) 2024-01-31T22:06:43.5900655Z remote: Compressing objects: 74% (799/1079) 2024-01-31T22:06:43.5912653Z remote: Compressing objects: 75% (810/1079) 2024-01-31T22:06:43.5913186Z remote: Compressing objects: 76% (821/1079) 2024-01-31T22:06:43.5926987Z remote: Compressing objects: 77% (831/1079) 2024-01-31T22:06:43.5927548Z remote: Compressing objects: 78% (842/1079) 2024-01-31T22:06:43.5928078Z remote: Compressing objects: 79% (853/1079) 2024-01-31T22:06:43.5928612Z remote: Compressing objects: 80% (864/1079) 2024-01-31T22:06:43.5929167Z remote: Compressing objects: 81% (874/1079) 2024-01-31T22:06:43.5929700Z remote: Compressing objects: 82% (885/1079) 2024-01-31T22:06:43.5941334Z remote: Compressing objects: 83% (896/1079) 2024-01-31T22:06:43.5941898Z remote: Compressing objects: 84% (907/1079) 2024-01-31T22:06:43.5942444Z remote: Compressing objects: 85% (918/1079) 2024-01-31T22:06:43.5942983Z remote: Compressing objects: 86% (928/1079) 2024-01-31T22:06:43.5943527Z remote: Compressing objects: 87% (939/1079) 2024-01-31T22:06:43.5944071Z remote: Compressing objects: 88% (950/1079) 2024-01-31T22:06:43.5944608Z remote: Compressing objects: 89% (961/1079) 2024-01-31T22:06:43.5945163Z remote: Compressing objects: 90% (972/1079) 2024-01-31T22:06:43.5980835Z remote: Compressing objects: 91% (982/1079) 2024-01-31T22:06:43.5981480Z remote: Compressing objects: 92% (993/1079) 2024-01-31T22:06:43.5982365Z remote: Compressing objects: 93% (1004/1079) 2024-01-31T22:06:43.5982945Z remote: Compressing objects: 94% (1015/1079) 2024-01-31T22:06:43.5983494Z remote: Compressing objects: 95% (1026/1079) 2024-01-31T22:06:43.5984034Z remote: Compressing objects: 96% (1036/1079) 2024-01-31T22:06:43.5984759Z remote: Compressing objects: 97% (1047/1079) 2024-01-31T22:06:43.5985450Z remote: Compressing objects: 98% (1058/1079) 2024-01-31T22:06:43.5986122Z remote: Compressing objects: 99% (1069/1079) 2024-01-31T22:06:43.5986663Z remote: Compressing objects: 100% (1079/1079) 2024-01-31T22:06:43.5987248Z remote: Compressing objects: 100% (1079/1079), done. 2024-01-31T22:07:09.9824916Z remote: Total 1251433 (delta 1825), reused 2118 (delta 1497), pack-reused 1248854 2024-01-31T22:07:35.9033309Z [command]/usr/bin/git rev-parse --verify --quiet 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528^{object} 2024-01-31T22:07:35.9056022Z 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:07:35.9060294Z ##[endgroup] 2024-01-31T22:07:35.9060858Z ##[group]Determining the checkout info 2024-01-31T22:07:35.9061678Z ##[endgroup] 2024-01-31T22:07:35.9062217Z ##[group]Checking out the ref 2024-01-31T22:07:35.9065236Z [command]/usr/bin/git checkout --quiet --force 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:07:37.0840159Z ##[endgroup] 2024-01-31T22:07:37.0840758Z ##[group]Setting up auth for fetching submodules 2024-01-31T22:07:37.0845511Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2024-01-31T22:07:37.0885325Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2024-01-31T22:07:37.0911268Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2024-01-31T22:07:37.0931490Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2024-01-31T22:07:37.0947994Z ##[endgroup] 2024-01-31T22:07:37.0948720Z ##[group]Fetching submodules 2024-01-31T22:07:37.0951877Z [command]/usr/bin/git submodule sync --recursive 2024-01-31T22:07:37.1156807Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2024-01-31T22:07:37.1363554Z Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni' 2024-01-31T22:07:37.1365417Z Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16' 2024-01-31T22:07:37.1367010Z Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv' 2024-01-31T22:07:37.1368760Z Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK' 2024-01-31T22:07:37.1370120Z Submodule 'third_party/QNNPACK' (https://github.com/pytorch/QNNPACK) registered for path 'third_party/QNNPACK' 2024-01-31T22:07:37.1371936Z Submodule 'third_party/VulkanMemoryAllocator' (https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git) registered for path 'third_party/VulkanMemoryAllocator' 2024-01-31T22:07:37.1373777Z Submodule 'third_party/XNNPACK' (https://github.com/google/XNNPACK.git) registered for path 'third_party/XNNPACK' 2024-01-31T22:07:37.1375401Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark' 2024-01-31T22:07:37.1377825Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo.git) registered for path 'third_party/cpuinfo' 2024-01-31T22:07:37.1379608Z Submodule 'third_party/cub' (https://github.com/NVlabs/cub.git) registered for path 'third_party/cub' 2024-01-31T22:07:37.1382423Z Submodule 'third_party/cudnn_frontend' (https://github.com/NVIDIA/cudnn-frontend.git) registered for path 'third_party/cudnn_frontend' 2024-01-31T22:07:37.1384388Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/cutlass' 2024-01-31T22:07:37.1387034Z Submodule 'third_party/eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'third_party/eigen' 2024-01-31T22:07:37.1389503Z Submodule 'third_party/fbgemm' (https://github.com/pytorch/fbgemm) registered for path 'third_party/fbgemm' 2024-01-31T22:07:37.1392161Z Submodule 'third_party/flatbuffers' (https://github.com/google/flatbuffers.git) registered for path 'third_party/flatbuffers' 2024-01-31T22:07:37.1394638Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/fmt' 2024-01-31T22:07:37.1397294Z Submodule 'third_party/foxi' (https://github.com/houseroad/foxi.git) registered for path 'third_party/foxi' 2024-01-31T22:07:37.1400251Z Submodule 'third_party/gemmlowp/gemmlowp' (https://github.com/google/gemmlowp.git) registered for path 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:07:37.1402958Z Submodule 'third_party/gloo' (https://github.com/facebookincubator/gloo) registered for path 'third_party/gloo' 2024-01-31T22:07:37.1406247Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest' 2024-01-31T22:07:37.1409256Z Submodule 'third_party/ideep' (https://github.com/intel/ideep) registered for path 'third_party/ideep' 2024-01-31T22:07:37.1412344Z Submodule 'third_party/ios-cmake' (https://github.com/Yangqing/ios-cmake.git) registered for path 'third_party/ios-cmake' 2024-01-31T22:07:37.1415385Z Submodule 'third_party/ittapi' (https://github.com/intel/ittapi.git) registered for path 'third_party/ittapi' 2024-01-31T22:07:37.1418480Z Submodule 'third_party/kineto' (https://github.com/pytorch/kineto) registered for path 'third_party/kineto' 2024-01-31T22:07:37.1421735Z Submodule 'third_party/mimalloc' (https://github.com/microsoft/mimalloc.git) registered for path 'third_party/mimalloc' 2024-01-31T22:07:37.1424922Z Submodule 'third_party/nccl/nccl' (https://github.com/NVIDIA/nccl) registered for path 'third_party/nccl/nccl' 2024-01-31T22:07:37.1428292Z Submodule 'third_party/neon2sse' (https://github.com/intel/ARM_NEON_2_x86_SSE.git) registered for path 'third_party/neon2sse' 2024-01-31T22:07:37.1431669Z Submodule 'third_party/nlohmann' (https://github.com/nlohmann/json.git) registered for path 'third_party/nlohmann' 2024-01-31T22:07:37.1434995Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx' 2024-01-31T22:07:37.1438602Z Submodule 'third_party/onnx-tensorrt' (https://github.com/onnx/onnx-tensorrt) registered for path 'third_party/onnx-tensorrt' 2024-01-31T22:07:37.1442145Z Submodule 'third_party/pocketfft' (https://github.com/mreineck/pocketfft) registered for path 'third_party/pocketfft' 2024-01-31T22:07:37.1446212Z Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf' 2024-01-31T22:07:37.1449972Z Submodule 'third_party/NNPACK_deps/psimd' (https://github.com/Maratyszcza/psimd.git) registered for path 'third_party/psimd' 2024-01-31T22:07:37.1453720Z Submodule 'third_party/NNPACK_deps/pthreadpool' (https://github.com/Maratyszcza/pthreadpool.git) registered for path 'third_party/pthreadpool' 2024-01-31T22:07:37.1457384Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/pybind11' 2024-01-31T22:07:37.1461342Z Submodule 'third_party/python-peachpy' (https://github.com/malfet/PeachPy.git) registered for path 'third_party/python-peachpy' 2024-01-31T22:07:37.1465122Z Submodule 'third_party/sleef' (https://github.com/shibatch/sleef) registered for path 'third_party/sleef' 2024-01-31T22:07:37.1469058Z Submodule 'third_party/tbb' (https://github.com/01org/tbb) registered for path 'third_party/tbb' 2024-01-31T22:07:37.1473360Z Submodule 'third_party/tensorpipe' (https://github.com/pytorch/tensorpipe.git) registered for path 'third_party/tensorpipe' 2024-01-31T22:07:37.1477475Z Submodule 'third_party/zstd' (https://github.com/facebook/zstd.git) registered for path 'third_party/zstd' 2024-01-31T22:07:37.1498118Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/android/libs/fbjni'... 2024-01-31T22:07:37.4254058Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FP16'... 2024-01-31T22:07:37.6002755Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FXdiv'... 2024-01-31T22:07:37.7864745Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NNPACK'... 2024-01-31T22:07:38.0086526Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/QNNPACK'... 2024-01-31T22:07:38.2611965Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/VulkanMemoryAllocator'... 2024-01-31T22:07:40.0966208Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'... 2024-01-31T22:07:48.6327240Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/benchmark'... 2024-01-31T22:07:49.0022587Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpuinfo'... 2024-01-31T22:07:49.5663114Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cub'... 2024-01-31T22:07:50.8314897Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cudnn_frontend'... 2024-01-31T22:07:51.9884379Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'... 2024-01-31T22:07:54.4459250Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/eigen'... 2024-01-31T22:08:00.2491115Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'... 2024-01-31T22:08:01.5593626Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'... 2024-01-31T22:08:02.9757635Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fmt'... 2024-01-31T22:08:04.1118116Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/foxi'... 2024-01-31T22:08:04.2715851Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gemmlowp/gemmlowp'... 2024-01-31T22:08:04.8749595Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gloo'... 2024-01-31T22:08:05.1952368Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/googletest'... 2024-01-31T22:08:06.2022254Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep'... 2024-01-31T22:08:06.5704466Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ios-cmake'... 2024-01-31T22:08:06.7599171Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ittapi'... 2024-01-31T22:08:06.9824356Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'... 2024-01-31T22:08:08.5866600Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/mimalloc'... 2024-01-31T22:08:09.1592970Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nccl/nccl'... 2024-01-31T22:08:09.7643370Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/neon2sse'... 2024-01-31T22:08:10.0622824Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'... 2024-01-31T22:08:17.0115649Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'... 2024-01-31T22:08:19.4817168Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx-tensorrt'... 2024-01-31T22:08:19.9304014Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pocketfft'... 2024-01-31T22:08:20.1402697Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'... 2024-01-31T22:08:27.2940914Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/psimd'... 2024-01-31T22:08:27.6822523Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pthreadpool'... 2024-01-31T22:08:27.9002820Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'... 2024-01-31T22:08:28.7129759Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'... 2024-01-31T22:08:29.0010269Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'... 2024-01-31T22:08:29.4889005Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tbb'... 2024-01-31T22:08:32.0820545Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'... 2024-01-31T22:08:32.4834330Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/zstd'... 2024-01-31T22:08:35.1266079Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2024-01-31T22:08:35.1345687Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2024-01-31T22:08:35.1408239Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2024-01-31T22:08:35.1590830Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2024-01-31T22:08:35.1779482Z Submodule path 'third_party/QNNPACK': checked out '7d2a4e9931a82adc3814275b6219a03e24e36b4c' 2024-01-31T22:08:35.2080708Z Submodule path 'third_party/VulkanMemoryAllocator': checked out 'a6bfc237255a6bac1513f7c1ebde6d8aed6b5191' 2024-01-31T22:08:36.0139573Z Submodule path 'third_party/XNNPACK': checked out 'd9cce341f86a207da9d851d05e26cd50b508b73c' 2024-01-31T22:08:36.0306051Z Submodule path 'third_party/benchmark': checked out '0d98dba29d66e93259db7daa53a9327df767a415' 2024-01-31T22:08:36.1113544Z Submodule path 'third_party/cpuinfo': checked out 'd6860c477c99f1fce9e28eb206891af3c0e1a1d7' 2024-01-31T22:08:36.1391819Z Submodule path 'third_party/cub': checked out 'd106ddb991a56c3df1b6d51b2409e36ba8181ce4' 2024-01-31T22:08:36.1614779Z Submodule path 'third_party/cudnn_frontend': checked out '9f82dda5c029d15a5f371f0fe003dc0c74a0c987' 2024-01-31T22:08:36.5703848Z Submodule path 'third_party/cutlass': checked out '44c704eae85da352d277d6f092f41412772f70e4' 2024-01-31T22:08:36.7759956Z Submodule path 'third_party/eigen': checked out '3147391d946bb4b6c68edd901f2add6ac1f31f8c' 2024-01-31T22:08:36.8292677Z Submodule path 'third_party/fbgemm': checked out 'dbc3157bf256f1339b3fa1fef2be89ac4078be0e' 2024-01-31T22:08:36.8305172Z Submodule 'third_party/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:08:36.8306817Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:08:36.8308965Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:08:36.8310869Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:08:36.8319109Z Submodule 'third_party/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:08:36.8331762Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/asmjit'... 2024-01-31T22:08:37.9659310Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/cpuinfo'... 2024-01-31T22:08:38.7861180Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/cutlass'... 2024-01-31T22:08:41.5802829Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/googletest'... 2024-01-31T22:08:42.9757229Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/hipify_torch'... 2024-01-31T22:08:43.5177842Z Submodule path 'third_party/fbgemm/third_party/asmjit': checked out 'd3fbf7c9bc7c1d1365a94a45614b91c5a3706b81' 2024-01-31T22:08:43.6006602Z Submodule path 'third_party/fbgemm/third_party/cpuinfo': checked out 'ed8b86a253800bafdb7b25c5c399f91bff9cb1f3' 2024-01-31T22:08:43.9423265Z Submodule path 'third_party/fbgemm/third_party/cutlass': checked out 'fc9ebc645b63f3a6bc80aaefde5c063fb72110d6' 2024-01-31T22:08:43.9946839Z Submodule path 'third_party/fbgemm/third_party/googletest': checked out 'cbf019de22c8dd37b2108da35b2748fd702d1796' 2024-01-31T22:08:44.0024604Z Submodule path 'third_party/fbgemm/third_party/hipify_torch': checked out '23f53b025b466d8ec3c45d52290d3442f7fbe6b1' 2024-01-31T22:08:44.0925134Z Submodule path 'third_party/flatbuffers': checked out '01834de25e4bf3975a9a00e816292b1ad0fe184b' 2024-01-31T22:08:44.1213156Z Submodule path 'third_party/fmt': checked out 'e69e5f977d458f2650bb346dadf2ad30c5320281' 2024-01-31T22:08:44.1276625Z Submodule path 'third_party/foxi': checked out 'c278588e34e535f0bb8f00df3880d26928038cad' 2024-01-31T22:08:44.1587601Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2024-01-31T22:08:44.1780638Z Submodule path 'third_party/gloo': checked out '5354032ea08eadd7fc4456477f7f7c6308818509' 2024-01-31T22:08:44.2163819Z Submodule path 'third_party/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2024-01-31T22:08:44.2251379Z Submodule path 'third_party/ideep': checked out 'e212bbbf81e2a29a82651ae55dc1effa8830f31a' 2024-01-31T22:08:44.2262804Z Submodule 'mkl-dnn' (https://github.com/intel/mkl-dnn.git) registered for path 'third_party/ideep/mkl-dnn' 2024-01-31T22:08:44.2279702Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep/mkl-dnn'... 2024-01-31T22:08:58.1010152Z Submodule path 'third_party/ideep/mkl-dnn': checked out '2dc95a2ad0841e29db8b22fbccaf3e5da7992b01' 2024-01-31T22:08:58.1079886Z Submodule path 'third_party/ios-cmake': checked out '8abaed637d56f1337d6e1d2c4026e25c1eade724' 2024-01-31T22:08:58.1217160Z Submodule path 'third_party/ittapi': checked out '5b8a7d7422611c3a0d799fb5fc5dd4abfae35b42' 2024-01-31T22:08:58.2022057Z Submodule path 'third_party/kineto': checked out 'eb56921f8cf6f4d14c16c45fe7586d5a6dccaf9e' 2024-01-31T22:08:58.2032480Z Submodule 'libkineto/third_party/dynolog' (https://github.com/facebookincubator/dynolog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:08:58.2034178Z Submodule 'libkineto/third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:08:58.2035870Z Submodule 'libkineto/third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:08:58.2051316Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog'... 2024-01-31T22:08:58.7850145Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/fmt'... 2024-01-31T22:09:00.1030897Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/googletest'... 2024-01-31T22:09:01.3315734Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' 2024-01-31T22:09:01.3326332Z Submodule 'third_party/DCGM' (https://github.com/NVIDIA/DCGM.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:01.3328288Z Submodule 'third_party/cpr' (https://github.com/libcpr/cpr.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:01.3329968Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:01.3331750Z Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:01.3333824Z Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:01.3336181Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:01.3338410Z Submodule 'third_party/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:01.3340414Z Submodule 'third_party/pfs' (https://github.com/dtrugman/pfs.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:01.3359054Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM'... 2024-01-31T22:09:02.4350003Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/cpr'... 2024-01-31T22:09:02.8132367Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/fmt'... 2024-01-31T22:09:04.1101620Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags'... 2024-01-31T22:09:04.4047416Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/glog'... 2024-01-31T22:09:04.8970438Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/googletest'... 2024-01-31T22:09:06.0469617Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/json'... 2024-01-31T22:09:12.0509440Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/pfs'... 2024-01-31T22:09:12.4447104Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2024-01-31T22:09:12.4579812Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2024-01-31T22:09:12.4875738Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2024-01-31T22:09:12.4973709Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2024-01-31T22:09:12.4983808Z Submodule 'doc' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:12.4999648Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc'... 2024-01-31T22:09:12.8025104Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2024-01-31T22:09:12.8164607Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2024-01-31T22:09:12.8502783Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' 2024-01-31T22:09:12.9309771Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2024-01-31T22:09:12.9439033Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2024-01-31T22:09:12.9729022Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out 'a33701196adfad74917046096bf5a2aa0ab0bb50' 2024-01-31T22:09:13.0209199Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' 2024-01-31T22:09:13.0503826Z Submodule path 'third_party/mimalloc': checked out 'b66e3214d8a104669c2ec05ae91ebc26a8f5ab78' 2024-01-31T22:09:13.0687537Z Submodule path 'third_party/nccl/nccl': checked out '8c6c5951854a57ba90c4424fa040497f6defac46' 2024-01-31T22:09:13.0787092Z Submodule path 'third_party/neon2sse': checked out '97a126f08ce318023be604d03f88bf0820a9464a' 2024-01-31T22:09:13.1643202Z Submodule path 'third_party/nlohmann': checked out '87cda1d6646592ac5866dc703c8e1839046a6806' 2024-01-31T22:09:13.4367107Z Submodule path 'third_party/onnx': checked out 'ccde5da81388ffa770ca98b64e07f803ad089414' 2024-01-31T22:09:13.4393836Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:13.4395738Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:13.4411473Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/benchmark'... 2024-01-31T22:09:13.8435573Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/pybind11'... 2024-01-31T22:09:14.7120091Z Submodule path 'third_party/onnx/third_party/benchmark': checked out '2dd015dfef425c866d9a43f2c67d8b52d709acb6' 2024-01-31T22:09:14.7390974Z Submodule path 'third_party/onnx/third_party/pybind11': checked out '5b0a6fc2017fcc176545afe3e09c9f9885283242' 2024-01-31T22:09:14.7528853Z Submodule path 'third_party/onnx-tensorrt': checked out 'c153211418a7c57ce071d9ce2a41f8d1c85a878f' 2024-01-31T22:09:14.7538452Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:14.7553435Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx-tensorrt/third_party/onnx'... 2024-01-31T22:09:17.2609582Z Submodule path 'third_party/onnx-tensorrt/third_party/onnx': checked out '765f5ee823a67a866f4bd28a9860e81f3c811ce8' 2024-01-31T22:09:17.2624079Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:17.2626068Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:17.2644180Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark'... 2024-01-31T22:09:17.6523563Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11'... 2024-01-31T22:09:18.4777532Z Submodule path 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark': checked out 'e776aa0275e293707b6a0901e0e8d8a8a3679508' 2024-01-31T22:09:18.5418485Z Submodule path 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11': checked out 'a1041190c8b8ff0cd9e2f0752248ad5e3789ea0c' 2024-01-31T22:09:18.5430364Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:18.5446117Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang'... 2024-01-31T22:09:18.7828275Z Submodule path 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2024-01-31T22:09:18.7894299Z Submodule path 'third_party/pocketfft': checked out '9d3ab05a7fffbc71a492bc6a17be034e83e8f0fe' 2024-01-31T22:09:19.0020205Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2024-01-31T22:09:19.0032710Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:19.0034775Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:19.0050578Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/benchmark'... 2024-01-31T22:09:19.3700456Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/googletest'... 2024-01-31T22:09:20.4363892Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2024-01-31T22:09:20.4975179Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2024-01-31T22:09:20.5033274Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2024-01-31T22:09:20.5120208Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2024-01-31T22:09:20.5401902Z Submodule path 'third_party/pybind11': checked out '8a099e44b3d5f85b20f05828d919d2332a8de841' 2024-01-31T22:09:20.5624206Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2024-01-31T22:09:20.5974948Z Submodule path 'third_party/sleef': checked out 'e0a003ee838b75d11763aa9c3ef17bf71a725bff' 2024-01-31T22:09:20.6875027Z Submodule path 'third_party/tbb': checked out 'a51a90bc609bb73db8ea13841b5cf7aa4344d4a9' 2024-01-31T22:09:20.7075646Z Submodule path 'third_party/tensorpipe': checked out '52791a2fd214b2a9dc5759d36725909c1daa7f2e' 2024-01-31T22:09:20.7088480Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:20.7090491Z Submodule 'third_party/libnop' (https://github.com/google/libnop.git) registered for path 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:20.7092458Z Submodule 'third_party/libuv' (https://github.com/libuv/libuv.git) registered for path 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:20.7094134Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:20.7110267Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/googletest'... 2024-01-31T22:09:21.7705046Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libnop'... 2024-01-31T22:09:22.0002821Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libuv'... 2024-01-31T22:09:23.0592423Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11'... 2024-01-31T22:09:24.0626820Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2024-01-31T22:09:24.0745557Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2024-01-31T22:09:24.1301371Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '1dff88e5161cba5c59276d2070d2e304e4dcb242' 2024-01-31T22:09:24.1525205Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2024-01-31T22:09:24.1535911Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:24.1553856Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11/tools/clang'... 2024-01-31T22:09:24.3395981Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2024-01-31T22:09:24.4709332Z Submodule path 'third_party/zstd': checked out 'aec56a52fbab207fc639a1937d1e708a282edca8' 2024-01-31T22:09:24.4733905Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2024-01-31T22:09:24.4940067Z Entering 'android/libs/fbjni' 2024-01-31T22:09:24.4969035Z Entering 'third_party/FP16' 2024-01-31T22:09:24.4997334Z Entering 'third_party/FXdiv' 2024-01-31T22:09:24.5026189Z Entering 'third_party/NNPACK' 2024-01-31T22:09:24.5055273Z Entering 'third_party/QNNPACK' 2024-01-31T22:09:24.5083503Z Entering 'third_party/VulkanMemoryAllocator' 2024-01-31T22:09:24.5111155Z Entering 'third_party/XNNPACK' 2024-01-31T22:09:24.5151936Z Entering 'third_party/benchmark' 2024-01-31T22:09:24.5181419Z Entering 'third_party/cpuinfo' 2024-01-31T22:09:24.5210656Z Entering 'third_party/cub' 2024-01-31T22:09:24.5239187Z Entering 'third_party/cudnn_frontend' 2024-01-31T22:09:24.5268085Z Entering 'third_party/cutlass' 2024-01-31T22:09:24.5301636Z Entering 'third_party/eigen' 2024-01-31T22:09:24.5331607Z Entering 'third_party/fbgemm' 2024-01-31T22:09:24.5359062Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:09:24.5388337Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:09:24.5414103Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:09:24.5446675Z Entering 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:09:24.5474888Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:09:24.5502913Z Entering 'third_party/flatbuffers' 2024-01-31T22:09:24.5533482Z Entering 'third_party/fmt' 2024-01-31T22:09:24.5561978Z Entering 'third_party/foxi' 2024-01-31T22:09:24.5591150Z Entering 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:09:24.5619690Z Entering 'third_party/gloo' 2024-01-31T22:09:24.5647889Z Entering 'third_party/googletest' 2024-01-31T22:09:24.5676794Z Entering 'third_party/ideep' 2024-01-31T22:09:24.5702669Z Entering 'third_party/ideep/mkl-dnn' 2024-01-31T22:09:24.5737804Z Entering 'third_party/ios-cmake' 2024-01-31T22:09:24.5765269Z Entering 'third_party/ittapi' 2024-01-31T22:09:24.5794180Z Entering 'third_party/kineto' 2024-01-31T22:09:24.5821157Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:09:24.5850119Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:24.5878721Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:24.5906007Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:24.5933563Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:24.5960479Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:24.5990587Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:24.6017707Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:24.6043428Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:24.6072680Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:24.6099945Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:09:24.6126951Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:09:24.6155958Z Entering 'third_party/mimalloc' 2024-01-31T22:09:24.6183875Z Entering 'third_party/nccl/nccl' 2024-01-31T22:09:24.6209842Z Entering 'third_party/neon2sse' 2024-01-31T22:09:24.6238507Z Entering 'third_party/nlohmann' 2024-01-31T22:09:24.6269019Z Entering 'third_party/onnx' 2024-01-31T22:09:24.6308888Z Entering 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:24.6337100Z Entering 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:24.6366453Z Entering 'third_party/onnx-tensorrt' 2024-01-31T22:09:24.6394387Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:24.6426068Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:24.6451368Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:24.6478167Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:24.6510973Z Entering 'third_party/pocketfft' 2024-01-31T22:09:24.6539732Z Entering 'third_party/protobuf' 2024-01-31T22:09:24.6569891Z Entering 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:24.6596424Z Entering 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:24.6623196Z Entering 'third_party/psimd' 2024-01-31T22:09:24.6651599Z Entering 'third_party/pthreadpool' 2024-01-31T22:09:24.6677309Z Entering 'third_party/pybind11' 2024-01-31T22:09:24.6705516Z Entering 'third_party/python-peachpy' 2024-01-31T22:09:24.6730084Z Entering 'third_party/sleef' 2024-01-31T22:09:24.6758338Z Entering 'third_party/tbb' 2024-01-31T22:09:24.6786281Z Entering 'third_party/tensorpipe' 2024-01-31T22:09:24.6814736Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:24.6842276Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:24.6870284Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:24.6896218Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:24.6921733Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:24.6953603Z Entering 'third_party/zstd' 2024-01-31T22:09:24.6991599Z ##[endgroup] 2024-01-31T22:09:24.6992233Z ##[group]Persisting credentials for submodules 2024-01-31T22:09:24.6995075Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2024-01-31T22:09:24.7202860Z Entering 'android/libs/fbjni' 2024-01-31T22:09:24.7237161Z Entering 'third_party/FP16' 2024-01-31T22:09:24.7271937Z Entering 'third_party/FXdiv' 2024-01-31T22:09:24.7308210Z Entering 'third_party/NNPACK' 2024-01-31T22:09:24.7341519Z Entering 'third_party/QNNPACK' 2024-01-31T22:09:24.7376298Z Entering 'third_party/VulkanMemoryAllocator' 2024-01-31T22:09:24.7410995Z Entering 'third_party/XNNPACK' 2024-01-31T22:09:24.7456381Z Entering 'third_party/benchmark' 2024-01-31T22:09:24.7490674Z Entering 'third_party/cpuinfo' 2024-01-31T22:09:24.7525238Z Entering 'third_party/cub' 2024-01-31T22:09:24.7558817Z Entering 'third_party/cudnn_frontend' 2024-01-31T22:09:24.7591790Z Entering 'third_party/cutlass' 2024-01-31T22:09:24.7628810Z Entering 'third_party/eigen' 2024-01-31T22:09:24.7666540Z Entering 'third_party/fbgemm' 2024-01-31T22:09:24.7699731Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:09:24.7734311Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:09:24.7769657Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:09:24.7809313Z Entering 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:09:24.7842816Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:09:24.7876613Z Entering 'third_party/flatbuffers' 2024-01-31T22:09:24.7915453Z Entering 'third_party/fmt' 2024-01-31T22:09:24.7951153Z Entering 'third_party/foxi' 2024-01-31T22:09:24.7984072Z Entering 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:09:24.8019842Z Entering 'third_party/gloo' 2024-01-31T22:09:24.8055068Z Entering 'third_party/googletest' 2024-01-31T22:09:24.8089922Z Entering 'third_party/ideep' 2024-01-31T22:09:24.8121471Z Entering 'third_party/ideep/mkl-dnn' 2024-01-31T22:09:24.8162986Z Entering 'third_party/ios-cmake' 2024-01-31T22:09:24.8199064Z Entering 'third_party/ittapi' 2024-01-31T22:09:24.8233670Z Entering 'third_party/kineto' 2024-01-31T22:09:24.8269643Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:09:24.8302276Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:24.8338885Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:24.8374935Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:24.8411106Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:24.8444045Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:24.8479956Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:24.8515770Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:24.8552390Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:24.8587095Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:24.8622961Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:09:24.8657803Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:09:24.8693836Z Entering 'third_party/mimalloc' 2024-01-31T22:09:24.8727710Z Entering 'third_party/nccl/nccl' 2024-01-31T22:09:24.8760085Z Entering 'third_party/neon2sse' 2024-01-31T22:09:24.8793924Z Entering 'third_party/nlohmann' 2024-01-31T22:09:24.8830674Z Entering 'third_party/onnx' 2024-01-31T22:09:24.8874003Z Entering 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:24.8909435Z Entering 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:24.8944605Z Entering 'third_party/onnx-tensorrt' 2024-01-31T22:09:24.8978202Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:24.9015918Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:24.9050861Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:24.9083316Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:24.9122587Z Entering 'third_party/pocketfft' 2024-01-31T22:09:24.9158112Z Entering 'third_party/protobuf' 2024-01-31T22:09:24.9196706Z Entering 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:24.9232897Z Entering 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:24.9266931Z Entering 'third_party/psimd' 2024-01-31T22:09:24.9303539Z Entering 'third_party/pthreadpool' 2024-01-31T22:09:24.9336775Z Entering 'third_party/pybind11' 2024-01-31T22:09:24.9371017Z Entering 'third_party/python-peachpy' 2024-01-31T22:09:24.9403705Z Entering 'third_party/sleef' 2024-01-31T22:09:24.9438605Z Entering 'third_party/tbb' 2024-01-31T22:09:24.9476246Z Entering 'third_party/tensorpipe' 2024-01-31T22:09:24.9512134Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:24.9545254Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:24.9579681Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:24.9613139Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:24.9646090Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:24.9682652Z Entering 'third_party/zstd' 2024-01-31T22:09:24.9730732Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2024-01-31T22:09:24.9933846Z Entering 'android/libs/fbjni' 2024-01-31T22:09:24.9965013Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2024-01-31T22:09:24.9978932Z Entering 'third_party/FP16' 2024-01-31T22:09:25.0011399Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2024-01-31T22:09:25.0023357Z Entering 'third_party/FXdiv' 2024-01-31T22:09:25.0054485Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2024-01-31T22:09:25.0067566Z Entering 'third_party/NNPACK' 2024-01-31T22:09:25.0098460Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2024-01-31T22:09:25.0112374Z Entering 'third_party/QNNPACK' 2024-01-31T22:09:25.0143274Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/QNNPACK/config remote.origin.url 2024-01-31T22:09:25.0157534Z Entering 'third_party/VulkanMemoryAllocator' 2024-01-31T22:09:25.0190412Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2024-01-31T22:09:25.0201494Z Entering 'third_party/XNNPACK' 2024-01-31T22:09:25.0231784Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2024-01-31T22:09:25.0256142Z Entering 'third_party/benchmark' 2024-01-31T22:09:25.0287687Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2024-01-31T22:09:25.0301205Z Entering 'third_party/cpuinfo' 2024-01-31T22:09:25.0334028Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2024-01-31T22:09:25.0346769Z Entering 'third_party/cub' 2024-01-31T22:09:25.0376709Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cub/config remote.origin.url 2024-01-31T22:09:25.0389566Z Entering 'third_party/cudnn_frontend' 2024-01-31T22:09:25.0420820Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2024-01-31T22:09:25.0433716Z Entering 'third_party/cutlass' 2024-01-31T22:09:25.0464346Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2024-01-31T22:09:25.0481127Z Entering 'third_party/eigen' 2024-01-31T22:09:25.0512739Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/eigen/config remote.origin.url 2024-01-31T22:09:25.0526894Z Entering 'third_party/fbgemm' 2024-01-31T22:09:25.0558624Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2024-01-31T22:09:25.0571897Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:09:25.0600563Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/asmjit/config remote.origin.url 2024-01-31T22:09:25.0611847Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:09:25.0643340Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/cpuinfo/config remote.origin.url 2024-01-31T22:09:25.0657548Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:09:25.0688851Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/cutlass/config remote.origin.url 2024-01-31T22:09:25.0706786Z Entering 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:09:25.0738417Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.0750791Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:09:25.0783718Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/hipify_torch/config remote.origin.url 2024-01-31T22:09:25.0797930Z Entering 'third_party/flatbuffers' 2024-01-31T22:09:25.0829254Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2024-01-31T22:09:25.0845442Z Entering 'third_party/fmt' 2024-01-31T22:09:25.0876096Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2024-01-31T22:09:25.0889860Z Entering 'third_party/foxi' 2024-01-31T22:09:25.0921338Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/foxi/config remote.origin.url 2024-01-31T22:09:25.0933080Z Entering 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:09:25.0963581Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2024-01-31T22:09:25.0975042Z Entering 'third_party/gloo' 2024-01-31T22:09:25.1004369Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2024-01-31T22:09:25.1016666Z Entering 'third_party/googletest' 2024-01-31T22:09:25.1048247Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.1061854Z Entering 'third_party/ideep' 2024-01-31T22:09:25.1092366Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2024-01-31T22:09:25.1105263Z Entering 'third_party/ideep/mkl-dnn' 2024-01-31T22:09:25.1136376Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2024-01-31T22:09:25.1155238Z Entering 'third_party/ios-cmake' 2024-01-31T22:09:25.1187403Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ios-cmake/config remote.origin.url 2024-01-31T22:09:25.1198694Z Entering 'third_party/ittapi' 2024-01-31T22:09:25.1231223Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2024-01-31T22:09:25.1246135Z Entering 'third_party/kineto' 2024-01-31T22:09:25.1278951Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2024-01-31T22:09:25.1292224Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:09:25.1323856Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2024-01-31T22:09:25.1336731Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:25.1370645Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2024-01-31T22:09:25.1383770Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:25.1417486Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2024-01-31T22:09:25.1430272Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:25.1462580Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2024-01-31T22:09:25.1476086Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:25.1509122Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2024-01-31T22:09:25.1521119Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:25.1553235Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2024-01-31T22:09:25.1567314Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:25.1597943Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2024-01-31T22:09:25.1610630Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:25.1642353Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.1655745Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:25.1687795Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2024-01-31T22:09:25.1700555Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:25.1732205Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2024-01-31T22:09:25.1745829Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:09:25.1778287Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2024-01-31T22:09:25.1789846Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:09:25.1821655Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.1835321Z Entering 'third_party/mimalloc' 2024-01-31T22:09:25.1867261Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2024-01-31T22:09:25.1879734Z Entering 'third_party/nccl/nccl' 2024-01-31T22:09:25.1911916Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nccl/nccl/config remote.origin.url 2024-01-31T22:09:25.1924211Z Entering 'third_party/neon2sse' 2024-01-31T22:09:25.1954883Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/neon2sse/config remote.origin.url 2024-01-31T22:09:25.1968212Z Entering 'third_party/nlohmann' 2024-01-31T22:09:25.2000894Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2024-01-31T22:09:25.2014838Z Entering 'third_party/onnx' 2024-01-31T22:09:25.2044374Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2024-01-31T22:09:25.2069433Z Entering 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.2102692Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/benchmark/config remote.origin.url 2024-01-31T22:09:25.2115150Z Entering 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.2148143Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2024-01-31T22:09:25.2162397Z Entering 'third_party/onnx-tensorrt' 2024-01-31T22:09:25.2195091Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx-tensorrt/config remote.origin.url 2024-01-31T22:09:25.2208398Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:25.2238927Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx-tensorrt/modules/third_party/onnx/config remote.origin.url 2024-01-31T22:09:25.2253809Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.2286374Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx-tensorrt/modules/third_party/onnx/modules/third_party/benchmark/config remote.origin.url 2024-01-31T22:09:25.2299056Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.2329569Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx-tensorrt/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2024-01-31T22:09:25.2341126Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.2372609Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx-tensorrt/modules/third_party/onnx/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2024-01-31T22:09:25.2389402Z Entering 'third_party/pocketfft' 2024-01-31T22:09:25.2419761Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2024-01-31T22:09:25.2430429Z Entering 'third_party/protobuf' 2024-01-31T22:09:25.2462460Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2024-01-31T22:09:25.2477279Z Entering 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:25.2508446Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2024-01-31T22:09:25.2521678Z Entering 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:25.2552795Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.2567659Z Entering 'third_party/psimd' 2024-01-31T22:09:25.2598123Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2024-01-31T22:09:25.2610954Z Entering 'third_party/pthreadpool' 2024-01-31T22:09:25.2641177Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2024-01-31T22:09:25.2655225Z Entering 'third_party/pybind11' 2024-01-31T22:09:25.2686851Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2024-01-31T22:09:25.2700196Z Entering 'third_party/python-peachpy' 2024-01-31T22:09:25.2730833Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2024-01-31T22:09:25.2743343Z Entering 'third_party/sleef' 2024-01-31T22:09:25.2772660Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2024-01-31T22:09:25.2785285Z Entering 'third_party/tbb' 2024-01-31T22:09:25.2816251Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tbb/config remote.origin.url 2024-01-31T22:09:25.2830192Z Entering 'third_party/tensorpipe' 2024-01-31T22:09:25.2861585Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2024-01-31T22:09:25.2874009Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:25.2906208Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2024-01-31T22:09:25.2918984Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:25.2952142Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2024-01-31T22:09:25.2965459Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:25.2997419Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2024-01-31T22:09:25.3010823Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:25.3041041Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2024-01-31T22:09:25.3053440Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.3085543Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2024-01-31T22:09:25.3101384Z Entering 'third_party/zstd' 2024-01-31T22:09:25.3132587Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/zstd/config remote.origin.url 2024-01-31T22:09:25.4011936Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2024-01-31T22:09:25.4217137Z Entering 'android/libs/fbjni' 2024-01-31T22:09:25.4244411Z Entering 'third_party/FP16' 2024-01-31T22:09:25.4273632Z Entering 'third_party/FXdiv' 2024-01-31T22:09:25.4301607Z Entering 'third_party/NNPACK' 2024-01-31T22:09:25.4328980Z Entering 'third_party/QNNPACK' 2024-01-31T22:09:25.4357425Z Entering 'third_party/VulkanMemoryAllocator' 2024-01-31T22:09:25.4383287Z Entering 'third_party/XNNPACK' 2024-01-31T22:09:25.4424964Z Entering 'third_party/benchmark' 2024-01-31T22:09:25.4452839Z Entering 'third_party/cpuinfo' 2024-01-31T22:09:25.4481471Z Entering 'third_party/cub' 2024-01-31T22:09:25.4508738Z Entering 'third_party/cudnn_frontend' 2024-01-31T22:09:25.4535635Z Entering 'third_party/cutlass' 2024-01-31T22:09:25.4569507Z Entering 'third_party/eigen' 2024-01-31T22:09:25.4599268Z Entering 'third_party/fbgemm' 2024-01-31T22:09:25.4627574Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:09:25.4654550Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:09:25.4683059Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:09:25.4715367Z Entering 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:09:25.4741784Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:09:25.4770539Z Entering 'third_party/flatbuffers' 2024-01-31T22:09:25.4800982Z Entering 'third_party/fmt' 2024-01-31T22:09:25.4829303Z Entering 'third_party/foxi' 2024-01-31T22:09:25.4856150Z Entering 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:09:25.4885645Z Entering 'third_party/gloo' 2024-01-31T22:09:25.4914724Z Entering 'third_party/googletest' 2024-01-31T22:09:25.4941536Z Entering 'third_party/ideep' 2024-01-31T22:09:25.4966513Z Entering 'third_party/ideep/mkl-dnn' 2024-01-31T22:09:25.4998559Z Entering 'third_party/ios-cmake' 2024-01-31T22:09:25.5027667Z Entering 'third_party/ittapi' 2024-01-31T22:09:25.5054399Z Entering 'third_party/kineto' 2024-01-31T22:09:25.5083342Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:09:25.5112390Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:25.5140418Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:25.5168543Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:25.5196030Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:25.5224813Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:25.5252003Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:25.5279547Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:25.5309715Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:25.5339772Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:25.5370789Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:09:25.5397355Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:09:25.5425397Z Entering 'third_party/mimalloc' 2024-01-31T22:09:25.5453708Z Entering 'third_party/nccl/nccl' 2024-01-31T22:09:25.5480028Z Entering 'third_party/neon2sse' 2024-01-31T22:09:25.5509883Z Entering 'third_party/nlohmann' 2024-01-31T22:09:25.5537056Z Entering 'third_party/onnx' 2024-01-31T22:09:25.5576171Z Entering 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.5606011Z Entering 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.5635384Z Entering 'third_party/onnx-tensorrt' 2024-01-31T22:09:25.5664224Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:25.5693587Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.5721492Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.5750089Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.5781350Z Entering 'third_party/pocketfft' 2024-01-31T22:09:25.5811097Z Entering 'third_party/protobuf' 2024-01-31T22:09:25.5842710Z Entering 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:25.5871574Z Entering 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:25.5898380Z Entering 'third_party/psimd' 2024-01-31T22:09:25.5928100Z Entering 'third_party/pthreadpool' 2024-01-31T22:09:25.5957461Z Entering 'third_party/pybind11' 2024-01-31T22:09:25.5984865Z Entering 'third_party/python-peachpy' 2024-01-31T22:09:25.6013060Z Entering 'third_party/sleef' 2024-01-31T22:09:25.6040675Z Entering 'third_party/tbb' 2024-01-31T22:09:25.6067747Z Entering 'third_party/tensorpipe' 2024-01-31T22:09:25.6094493Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:25.6121726Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:25.6146465Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:25.6171478Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:25.6197215Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.6229202Z Entering 'third_party/zstd' 2024-01-31T22:09:25.6267982Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2024-01-31T22:09:25.6472503Z Entering 'android/libs/fbjni' 2024-01-31T22:09:25.6498214Z Entering 'third_party/FP16' 2024-01-31T22:09:25.6525124Z Entering 'third_party/FXdiv' 2024-01-31T22:09:25.6553823Z Entering 'third_party/NNPACK' 2024-01-31T22:09:25.6583888Z Entering 'third_party/QNNPACK' 2024-01-31T22:09:25.6610334Z Entering 'third_party/VulkanMemoryAllocator' 2024-01-31T22:09:25.6639515Z Entering 'third_party/XNNPACK' 2024-01-31T22:09:25.6676157Z Entering 'third_party/benchmark' 2024-01-31T22:09:25.6705287Z Entering 'third_party/cpuinfo' 2024-01-31T22:09:25.6731276Z Entering 'third_party/cub' 2024-01-31T22:09:25.6761922Z Entering 'third_party/cudnn_frontend' 2024-01-31T22:09:25.6790830Z Entering 'third_party/cutlass' 2024-01-31T22:09:25.6822829Z Entering 'third_party/eigen' 2024-01-31T22:09:25.6852601Z Entering 'third_party/fbgemm' 2024-01-31T22:09:25.6878180Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-01-31T22:09:25.6905923Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-01-31T22:09:25.6931880Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-01-31T22:09:25.6962813Z Entering 'third_party/fbgemm/third_party/googletest' 2024-01-31T22:09:25.6991621Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-01-31T22:09:25.7019577Z Entering 'third_party/flatbuffers' 2024-01-31T22:09:25.7046880Z Entering 'third_party/fmt' 2024-01-31T22:09:25.7076038Z Entering 'third_party/foxi' 2024-01-31T22:09:25.7104363Z Entering 'third_party/gemmlowp/gemmlowp' 2024-01-31T22:09:25.7132994Z Entering 'third_party/gloo' 2024-01-31T22:09:25.7161684Z Entering 'third_party/googletest' 2024-01-31T22:09:25.7191056Z Entering 'third_party/ideep' 2024-01-31T22:09:25.7217100Z Entering 'third_party/ideep/mkl-dnn' 2024-01-31T22:09:25.7249637Z Entering 'third_party/ios-cmake' 2024-01-31T22:09:25.7277567Z Entering 'third_party/ittapi' 2024-01-31T22:09:25.7307703Z Entering 'third_party/kineto' 2024-01-31T22:09:25.7336190Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-01-31T22:09:25.7363757Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-01-31T22:09:25.7392813Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-01-31T22:09:25.7421456Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-01-31T22:09:25.7449341Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-01-31T22:09:25.7474341Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-01-31T22:09:25.7503005Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-01-31T22:09:25.7531794Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-01-31T22:09:25.7557658Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-01-31T22:09:25.7587724Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-01-31T22:09:25.7615974Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-01-31T22:09:25.7642164Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-01-31T22:09:25.7672152Z Entering 'third_party/mimalloc' 2024-01-31T22:09:25.7699749Z Entering 'third_party/nccl/nccl' 2024-01-31T22:09:25.7728583Z Entering 'third_party/neon2sse' 2024-01-31T22:09:25.7757656Z Entering 'third_party/nlohmann' 2024-01-31T22:09:25.7785896Z Entering 'third_party/onnx' 2024-01-31T22:09:25.7826807Z Entering 'third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.7855082Z Entering 'third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.7885123Z Entering 'third_party/onnx-tensorrt' 2024-01-31T22:09:25.7914770Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-01-31T22:09:25.7947239Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-01-31T22:09:25.7973638Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-01-31T22:09:25.7999771Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.8032134Z Entering 'third_party/pocketfft' 2024-01-31T22:09:25.8061334Z Entering 'third_party/protobuf' 2024-01-31T22:09:25.8089199Z Entering 'third_party/protobuf/third_party/benchmark' 2024-01-31T22:09:25.8116045Z Entering 'third_party/protobuf/third_party/googletest' 2024-01-31T22:09:25.8146456Z Entering 'third_party/psimd' 2024-01-31T22:09:25.8174654Z Entering 'third_party/pthreadpool' 2024-01-31T22:09:25.8201445Z Entering 'third_party/pybind11' 2024-01-31T22:09:25.8230369Z Entering 'third_party/python-peachpy' 2024-01-31T22:09:25.8259491Z Entering 'third_party/sleef' 2024-01-31T22:09:25.8287375Z Entering 'third_party/tbb' 2024-01-31T22:09:25.8316622Z Entering 'third_party/tensorpipe' 2024-01-31T22:09:25.8345555Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-01-31T22:09:25.8371777Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-01-31T22:09:25.8397828Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-01-31T22:09:25.8426931Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-01-31T22:09:25.8452454Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-01-31T22:09:25.8480161Z Entering 'third_party/zstd' 2024-01-31T22:09:25.8517886Z ##[endgroup] 2024-01-31T22:09:25.8549673Z [command]/usr/bin/git log -1 --format='%H' 2024-01-31T22:09:25.8571950Z '86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528' 2024-01-31T22:09:25.8722294Z Prepare all required actions 2024-01-31T22:09:25.8722744Z Getting action download info 2024-01-31T22:09:26.0141910Z ##[group]Run ./.github/actions/setup-linux 2024-01-31T22:09:26.0142337Z env: 2024-01-31T22:09:26.0142618Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:26.0143001Z ##[endgroup] 2024-01-31T22:09:26.0223761Z ##[group]Run set -euo pipefail 2024-01-31T22:09:26.0224194Z set -euo pipefail 2024-01-31T22:09:26.0224592Z function get_ec2_metadata() { 2024-01-31T22:09:26.0225132Z  # Pulled from instance metadata endpoint for EC2 2024-01-31T22:09:26.0226035Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2024-01-31T22:09:26.0226820Z  category=$1 2024-01-31T22:09:26.0227357Z  # If it is GCP runner (runner name contains gcp), do not run this 2024-01-31T22:09:26.0227999Z  runner_name_str=i-0f7dd096d568eb180 2024-01-31T22:09:26.0228557Z  if [[ $runner_name_str != *"gcp"* ]]; then 2024-01-31T22:09:26.0229226Z  curl -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2024-01-31T22:09:26.0229813Z  else 2024-01-31T22:09:26.0230339Z  echo "Runner is from Google Cloud Platform, No info on ec2 metadata" 2024-01-31T22:09:26.0230947Z  fi 2024-01-31T22:09:26.0231227Z } 2024-01-31T22:09:26.0231603Z echo "ami-id: $(get_ec2_metadata ami-id)" 2024-01-31T22:09:26.0232202Z echo "instance-id: $(get_ec2_metadata instance-id)" 2024-01-31T22:09:26.0232863Z echo "instance-type: $(get_ec2_metadata instance-type)" 2024-01-31T22:09:26.0233443Z echo "system info $(uname -a)" 2024-01-31T22:09:26.0241614Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:26.0242117Z env: 2024-01-31T22:09:26.0242390Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:26.0242738Z ##[endgroup] 2024-01-31T22:09:26.0311643Z ami-id: ami-0062355a529d6089c 2024-01-31T22:09:26.0357284Z instance-id: i-0f7dd096d568eb180 2024-01-31T22:09:26.0395775Z instance-type: g5.4xlarge 2024-01-31T22:09:26.0402029Z system info Linux ip-10-0-30-139.ec2.internal 4.14.328-248.540.amzn2.x86_64 #1 SMP Tue Nov 7 03:51:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 2024-01-31T22:09:26.0421174Z ##[group]Run if systemctl is-active --quiet docker; then 2024-01-31T22:09:26.0421824Z if systemctl is-active --quiet docker; then 2024-01-31T22:09:26.0422365Z  echo "Docker daemon is running..."; 2024-01-31T22:09:26.0422818Z else 2024-01-31T22:09:26.0423318Z  echo "Starting docker deamon..." && sudo systemctl start docker; 2024-01-31T22:09:26.0423907Z fi 2024-01-31T22:09:26.0431602Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:26.0432084Z env: 2024-01-31T22:09:26.0432366Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:26.0432715Z ##[endgroup] 2024-01-31T22:09:26.0471113Z Docker daemon is running... 2024-01-31T22:09:26.0516910Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-01-31T22:09:26.0517541Z with: 2024-01-31T22:09:26.0517892Z shell: bash 2024-01-31T22:09:26.0518270Z timeout_minutes: 5 2024-01-31T22:09:26.0518668Z max_attempts: 3 2024-01-31T22:09:26.0519073Z retry_wait_seconds: 30 2024-01-31T22:09:26.0520532Z command: AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" 2024-01-31T22:09:26.0522206Z polling_interval_seconds: 1 2024-01-31T22:09:26.0522660Z warning_on_retry: true 2024-01-31T22:09:26.0523087Z continue_on_error: false 2024-01-31T22:09:26.0523492Z env: 2024-01-31T22:09:26.0523838Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:26.0524484Z AWS_RETRY_MODE: standard 2024-01-31T22:09:26.0525099Z AWS_MAX_ATTEMPTS: 5 2024-01-31T22:09:26.0525554Z AWS_DEFAULT_REGION: us-east-1 2024-01-31T22:09:26.0526185Z ##[endgroup] 2024-01-31T22:09:26.9051662Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2024-01-31T22:09:26.9052536Z Configure a credential helper to remove this warning. See 2024-01-31T22:09:26.9053374Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2024-01-31T22:09:26.9053891Z 2024-01-31T22:09:26.9054015Z Login Succeeded 2024-01-31T22:09:27.1019186Z Command completed after 1 attempt(s). 2024-01-31T22:09:27.1061165Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-01-31T22:09:27.1061902Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-01-31T22:09:27.1062557Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-01-31T22:09:27.1070522Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.1071010Z env: 2024-01-31T22:09:27.1071290Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.1071633Z ##[endgroup] 2024-01-31T22:09:27.1124523Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2024-01-31T22:09:27.1125614Z # ignore expansion of "docker ps -q" since it could be empty 2024-01-31T22:09:27.1126204Z # shellcheck disable=SC2046 2024-01-31T22:09:27.1126655Z docker stop $(docker ps -q) || true 2024-01-31T22:09:27.1127133Z # Prune all of the docker images 2024-01-31T22:09:27.1127579Z docker system prune -af 2024-01-31T22:09:27.1134635Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.1135135Z env: 2024-01-31T22:09:27.1135419Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.1135762Z ##[endgroup] 2024-01-31T22:09:27.1439036Z "docker stop" requires at least 1 argument. 2024-01-31T22:09:27.1439733Z See 'docker stop --help'. 2024-01-31T22:09:27.1439975Z 2024-01-31T22:09:27.1440206Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2024-01-31T22:09:27.1440588Z 2024-01-31T22:09:27.1441746Z Stop one or more running containers 2024-01-31T22:09:27.1640825Z Total reclaimed space: 0B 2024-01-31T22:09:27.1672466Z ##[group]Run set +e 2024-01-31T22:09:27.1672820Z set +e 2024-01-31T22:09:27.1673116Z set -x 2024-01-31T22:09:27.1673404Z  2024-01-31T22:09:27.1673720Z PT_DOMAIN=download.pytorch.org 2024-01-31T22:09:27.1674508Z # TODO: Flaky access to download.pytorch.org https://github.com/pytorch/pytorch/issues/100400, 2024-01-31T22:09:27.1675574Z # cleaning this up once the issue is fixed. There are more than one resolved IP here, the last 2024-01-31T22:09:27.1676332Z # one is returned at random 2024-01-31T22:09:27.1676865Z RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" | tail -n1) 2024-01-31T22:09:27.1677382Z  2024-01-31T22:09:27.1677697Z if [ -z "${RESOLVED_IP}" ]; then 2024-01-31T22:09:27.1678305Z  echo "Couldn't resolve ${PT_DOMAIN}, retrying with Google DNS..." 2024-01-31T22:09:27.1679085Z  RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" @8.8.8.8 | tail -n1) 2024-01-31T22:09:27.1679638Z  2024-01-31T22:09:27.1679950Z  if [ -z "${RESOLVED_IP}" ]; then 2024-01-31T22:09:27.1680500Z  echo "Couldn't resolve ${PT_DOMAIN}, exiting..." 2024-01-31T22:09:27.1681009Z  exit 1 2024-01-31T22:09:27.1681312Z  fi 2024-01-31T22:09:27.1681597Z fi 2024-01-31T22:09:27.1681878Z  2024-01-31T22:09:27.1682227Z if grep -r "${PT_DOMAIN}" /etc/hosts; then 2024-01-31T22:09:27.1682746Z  # Clean up any old records first 2024-01-31T22:09:27.1683386Z  sudo sed -i "/${PT_DOMAIN}/d" /etc/hosts 2024-01-31T22:09:27.1683821Z fi 2024-01-31T22:09:27.1684096Z  2024-01-31T22:09:27.1684527Z echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts 2024-01-31T22:09:27.1685435Z cat /etc/hosts 2024-01-31T22:09:27.1693325Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.1693945Z env: 2024-01-31T22:09:27.1694235Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.1694577Z ##[endgroup] 2024-01-31T22:09:27.1714609Z + PT_DOMAIN=download.pytorch.org 2024-01-31T22:09:27.1718582Z ++ dig -4 +short download.pytorch.org 2024-01-31T22:09:27.1719359Z ++ tail -n1 2024-01-31T22:09:27.1827775Z + RESOLVED_IP=18.160.10.76 2024-01-31T22:09:27.1828338Z + '[' -z 18.160.10.76 ']' 2024-01-31T22:09:27.1828857Z + grep -r download.pytorch.org /etc/hosts 2024-01-31T22:09:27.1833398Z 18.160.10.36 download.pytorch.org 2024-01-31T22:09:27.1834241Z + sudo sed -i /download.pytorch.org/d /etc/hosts 2024-01-31T22:09:27.1931085Z + echo '18.160.10.76 download.pytorch.org' 2024-01-31T22:09:27.1931892Z + sudo tee -a /etc/hosts 2024-01-31T22:09:27.2201068Z 18.160.10.76 download.pytorch.org 2024-01-31T22:09:27.2214907Z + cat /etc/hosts 2024-01-31T22:09:27.2219292Z 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 2024-01-31T22:09:27.2229985Z ::1 localhost6 localhost6.localdomain6 2024-01-31T22:09:27.2230806Z 18.160.10.76 download.pytorch.org 2024-01-31T22:09:27.2369452Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2024-01-31T22:09:27.2370124Z with: 2024-01-31T22:09:27.2371153Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2372364Z docker-build-dir: .ci/docker 2024-01-31T22:09:27.2372831Z working-directory: . 2024-01-31T22:09:27.2373380Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-01-31T22:09:27.2374012Z force-push: false 2024-01-31T22:09:27.2374398Z env: 2024-01-31T22:09:27.2374748Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.2375174Z ##[endgroup] 2024-01-31T22:09:27.2396045Z ##[group]Run set -ex 2024-01-31T22:09:27.2396486Z set -ex 2024-01-31T22:09:27.2396851Z  2024-01-31T22:09:27.2397488Z # If the docker build directory or the build script doesn't exist, the action will 2024-01-31T22:09:27.2398543Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2024-01-31T22:09:27.2399417Z # job could then download the pre-built image as usual 2024-01-31T22:09:27.2400239Z if [[ ! -d "${DOCKER_BUILD_DIR}" ]] || [[ ! -f "${DOCKER_BUILD_DIR}/build.sh" ]]; then 2024-01-31T22:09:27.2400989Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2401698Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2402339Z  2024-01-31T22:09:27.2402930Z  echo "There is no Docker build script in ${REPO_NAME} repo, skipping..." 2024-01-31T22:09:27.2403628Z  exit 0 2024-01-31T22:09:27.2404009Z else 2024-01-31T22:09:27.2404455Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2405213Z fi 2024-01-31T22:09:27.2405568Z  2024-01-31T22:09:27.2406113Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2024-01-31T22:09:27.2407050Z  # The docker image name already includes the ECR prefix and tag, so we can just 2024-01-31T22:09:27.2407906Z  # use it as it is, but first let's extract the tag 2024-01-31T22:09:27.2408692Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2024-01-31T22:09:27.2409500Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2410282Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2411044Z else 2024-01-31T22:09:27.2411568Z  DOCKER_TAG=$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2024-01-31T22:09:27.2412308Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2413290Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2414116Z fi 2024-01-31T22:09:27.2422144Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.2422755Z env: 2024-01-31T22:09:27.2423105Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.2423553Z REPO_NAME: pytorch 2024-01-31T22:09:27.2424738Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2426005Z DOCKER_BUILD_DIR: .ci/docker 2024-01-31T22:09:27.2426605Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-01-31T22:09:27.2427252Z ##[endgroup] 2024-01-31T22:09:27.2448381Z + [[ ! -d .ci/docker ]] 2024-01-31T22:09:27.2448937Z + [[ ! -f .ci/docker/build.sh ]] 2024-01-31T22:09:27.2449411Z + echo skip=false 2024-01-31T22:09:27.2451132Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2024-01-31T22:09:27.2453606Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2454739Z ++ awk -F '[:,]' '{print $2}' 2024-01-31T22:09:27.2459250Z + DOCKER_TAG=3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2460216Z + echo docker-tag=3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2461597Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2546279Z ##[group]Run set +e 2024-01-31T22:09:27.2546711Z set +e 2024-01-31T22:09:27.2547077Z set -x 2024-01-31T22:09:27.2547429Z  2024-01-31T22:09:27.2547981Z # Check if image already exists, if it does then skip building it 2024-01-31T22:09:27.2548764Z if docker manifest inspect "${DOCKER_IMAGE}"; then 2024-01-31T22:09:27.2549335Z  exit 0 2024-01-31T22:09:27.2549702Z fi 2024-01-31T22:09:27.2550050Z  2024-01-31T22:09:27.2550621Z # NB: This part requires a full checkout. Otherwise, the merge base will 2024-01-31T22:09:27.2551521Z # be empty. The default action would be to continue rebuild the image 2024-01-31T22:09:27.2552325Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2024-01-31T22:09:27.2553070Z  # if we're on the base branch then use the parent commit 2024-01-31T22:09:27.2553748Z  MERGE_BASE=$(git rev-parse HEAD~) 2024-01-31T22:09:27.2554258Z else 2024-01-31T22:09:27.2554792Z  # otherwise we're on a PR, so use the most recent base commit 2024-01-31T22:09:27.2555546Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2024-01-31T22:09:27.2556112Z fi 2024-01-31T22:09:27.2556448Z  2024-01-31T22:09:27.2556825Z if [[ -z "${MERGE_BASE}" ]]; then 2024-01-31T22:09:27.2557392Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2557913Z  2024-01-31T22:09:27.2558633Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2024-01-31T22:09:27.2559458Z  exit 0 2024-01-31T22:09:27.2559823Z fi 2024-01-31T22:09:27.2560176Z  2024-01-31T22:09:27.2560687Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2024-01-31T22:09:27.2561728Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2024-01-31T22:09:27.2562710Z  exit 1 2024-01-31T22:09:27.2563070Z fi 2024-01-31T22:09:27.2563410Z  2024-01-31T22:09:27.2563961Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2024-01-31T22:09:27.2565251Z # If no image exists but the hash is the same as the previous hash then we should error out here 2024-01-31T22:09:27.2566199Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2024-01-31T22:09:27.2567243Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2024-01-31T22:09:27.2568424Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2024-01-31T22:09:27.2569135Z fi 2024-01-31T22:09:27.2569476Z  2024-01-31T22:09:27.2569894Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2024-01-31T22:09:27.2578050Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.2578596Z env: 2024-01-31T22:09:27.2578934Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.2579358Z DOCKER_BUILD_DIR: .ci/docker 2024-01-31T22:09:27.2579894Z BASE_REVISION: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:09:27.2581091Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2582375Z DOCKER_TAG: 3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.2582913Z ##[endgroup] 2024-01-31T22:09:27.2604476Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.5557343Z { 2024-01-31T22:09:27.5558511Z "schemaVersion": 2, 2024-01-31T22:09:27.5559164Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2024-01-31T22:09:27.5559840Z "config": { 2024-01-31T22:09:27.5560442Z "mediaType": "application/vnd.docker.container.image.v1+json", 2024-01-31T22:09:27.5561248Z "size": 45676, 2024-01-31T22:09:27.5562048Z "digest": "sha256:7d78b20012855630db29f7726cc638698c783297a18547e4ffbae45c1da59435" 2024-01-31T22:09:27.5562934Z }, 2024-01-31T22:09:27.5563371Z "layers": [ 2024-01-31T22:09:27.5563937Z { 2024-01-31T22:09:27.5564578Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5565765Z "size": 28580681, 2024-01-31T22:09:27.5566545Z "digest": "sha256:7a2c559011895d255fce249c00396abff5ae7e0c0a92931d0ed493e71de78e3a" 2024-01-31T22:09:27.5567390Z }, 2024-01-31T22:09:27.5567712Z { 2024-01-31T22:09:27.5568222Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5568852Z "size": 7943451, 2024-01-31T22:09:27.5569462Z "digest": "sha256:224fe954d7252f10539d243d6c9688806f7d13ad775ed02e7f7c79077844510d" 2024-01-31T22:09:27.5570122Z }, 2024-01-31T22:09:27.5570451Z { 2024-01-31T22:09:27.5570954Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5571695Z "size": 55728572, 2024-01-31T22:09:27.5572366Z "digest": "sha256:75722010b82e31715876aeeed0b2cee414296f0124fdfa061ab845ba2a158450" 2024-01-31T22:09:27.5573050Z }, 2024-01-31T22:09:27.5573608Z { 2024-01-31T22:09:27.5574129Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5574757Z "size": 186, 2024-01-31T22:09:27.5575381Z "digest": "sha256:d527cbbb87e3016fd72a18a9b468c945ad0ca27c5770b39debd6ed704db3a195" 2024-01-31T22:09:27.5576077Z }, 2024-01-31T22:09:27.5576407Z { 2024-01-31T22:09:27.5576914Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5577544Z "size": 6886, 2024-01-31T22:09:27.5578162Z "digest": "sha256:b57676e46aee1a8c82e528d78e5a13e31142524eea31c8b213d69ddcb6f1fe80" 2024-01-31T22:09:27.5578848Z }, 2024-01-31T22:09:27.5579176Z { 2024-01-31T22:09:27.5579836Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5580458Z "size": 1329001756, 2024-01-31T22:09:27.5581108Z "digest": "sha256:a8c1e85b5e14cec7af70bf304cb4d4cee6a1d25eb8215b2cf4fdc33e5af5e108" 2024-01-31T22:09:27.5581809Z }, 2024-01-31T22:09:27.5582127Z { 2024-01-31T22:09:27.5582639Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5583271Z "size": 62501, 2024-01-31T22:09:27.5583883Z "digest": "sha256:a41a8d1c11c8d80fe4e82b0d05478f8d51176ff20b8350905fc1b25c93a51198" 2024-01-31T22:09:27.5584567Z }, 2024-01-31T22:09:27.5584894Z { 2024-01-31T22:09:27.5585397Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5586029Z "size": 1684, 2024-01-31T22:09:27.5586631Z "digest": "sha256:0c12278907551c2962927d27c115f6f7bf0df894318b8aea6ece3ef01ccd0a8a" 2024-01-31T22:09:27.5587300Z }, 2024-01-31T22:09:27.5587627Z { 2024-01-31T22:09:27.5588134Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5588763Z "size": 1523, 2024-01-31T22:09:27.5589397Z "digest": "sha256:d8d1234baab3ec9ccb8bb710fc6b8ff6c10896ba2e8d27a347583eca770f9ff1" 2024-01-31T22:09:27.5590099Z }, 2024-01-31T22:09:27.5590420Z { 2024-01-31T22:09:27.5590930Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5591566Z "size": 2528295403, 2024-01-31T22:09:27.5592322Z "digest": "sha256:7ed32bc8e4696fcdb2feef850781160597b2275ad756819c4add88236b0577d5" 2024-01-31T22:09:27.5593020Z }, 2024-01-31T22:09:27.5593351Z { 2024-01-31T22:09:27.5593854Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5594481Z "size": 86016, 2024-01-31T22:09:27.5595111Z "digest": "sha256:ec1e7978c1fe161ced1d98092a51e7c5953ca5fda5577f54df9dbda4afff1b2b" 2024-01-31T22:09:27.5595800Z }, 2024-01-31T22:09:27.5596125Z { 2024-01-31T22:09:27.5596637Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5597264Z "size": 1830, 2024-01-31T22:09:27.5597877Z "digest": "sha256:e6c6c62213b7c195dfe2d0aa47821a3f7412049164f4e0c73ceaa562c87d64c0" 2024-01-31T22:09:27.5598564Z }, 2024-01-31T22:09:27.5598884Z { 2024-01-31T22:09:27.5599392Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5600026Z "size": 246718820, 2024-01-31T22:09:27.5600653Z "digest": "sha256:a72e1c3d7a64842cfd1313d80626249e72fc1081f5e80b36ca4eb8e1bd7e8e02" 2024-01-31T22:09:27.5601342Z }, 2024-01-31T22:09:27.5601669Z { 2024-01-31T22:09:27.5602202Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5602858Z "size": 545, 2024-01-31T22:09:27.5603477Z "digest": "sha256:e38e3f87c2f72a77d9fe0ac7f00ac2920caa33a23c526a879afd64de0157a054" 2024-01-31T22:09:27.5604163Z }, 2024-01-31T22:09:27.5604488Z { 2024-01-31T22:09:27.5605292Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5605933Z "size": 1284, 2024-01-31T22:09:27.5606541Z "digest": "sha256:4414a4cf055fa864f4140258e5c0f899764df642e014f296e44fc2ea239fbb9d" 2024-01-31T22:09:27.5607222Z }, 2024-01-31T22:09:27.5607541Z { 2024-01-31T22:09:27.5608051Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5608680Z "size": 485, 2024-01-31T22:09:27.5609285Z "digest": "sha256:3a9f3bab87ac8295217b45e11f84aa500fdf787306de8e081e015aa7c1a8b363" 2024-01-31T22:09:27.5609971Z }, 2024-01-31T22:09:27.5610300Z { 2024-01-31T22:09:27.5610806Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5611440Z "size": 91701785, 2024-01-31T22:09:27.5612089Z "digest": "sha256:8cb09e71074f9ad25ef0fbcdc95e40cbac8dac5bf4828dcd29beea7c4cf5bab7" 2024-01-31T22:09:27.5612793Z }, 2024-01-31T22:09:27.5613122Z { 2024-01-31T22:09:27.5613633Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5614252Z "size": 3169, 2024-01-31T22:09:27.5614970Z "digest": "sha256:dc1cd3eaed4bb1b634432f28fce79fa9c42a452f5241307127828ea9c46875c0" 2024-01-31T22:09:27.5615665Z }, 2024-01-31T22:09:27.5615985Z { 2024-01-31T22:09:27.5616495Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5617127Z "size": 1812, 2024-01-31T22:09:27.5617731Z "digest": "sha256:709d1f5e9caf19f6d66f6fd26e1094935d11b54c0b2a8532317c386f678db77d" 2024-01-31T22:09:27.5618417Z }, 2024-01-31T22:09:27.5618748Z { 2024-01-31T22:09:27.5619252Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5619882Z "size": 701, 2024-01-31T22:09:27.5620497Z "digest": "sha256:04bae84c1248a9df743ea3e6651b4e26c8385550229adc1a830766d4dc368f12" 2024-01-31T22:09:27.5621176Z }, 2024-01-31T22:09:27.5621504Z { 2024-01-31T22:09:27.5622016Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5622641Z "size": 2622646901, 2024-01-31T22:09:27.5623270Z "digest": "sha256:39c09b5c3871d3937018fc25d7c1355d9fc6b78021a6c22356b0e4b221dd75c1" 2024-01-31T22:09:27.5623955Z }, 2024-01-31T22:09:27.5624273Z { 2024-01-31T22:09:27.5624786Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5625416Z "size": 380, 2024-01-31T22:09:27.5626028Z "digest": "sha256:e3a19dc71e24e30c31fcf4e0b5df83b4b80af89865f3a9fa378e7863f286e7db" 2024-01-31T22:09:27.5626721Z }, 2024-01-31T22:09:27.5627060Z { 2024-01-31T22:09:27.5627634Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5628260Z "size": 6324, 2024-01-31T22:09:27.5628866Z "digest": "sha256:2c6f9f6f78ae4807d314278b924b8ad57cb65f8215d009335fa912282a59b687" 2024-01-31T22:09:27.5629534Z }, 2024-01-31T22:09:27.5629852Z { 2024-01-31T22:09:27.5630352Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5630964Z "size": 803, 2024-01-31T22:09:27.5631573Z "digest": "sha256:28bc578be05685931a1e7a6b7ba58fc8e9e402e13fa17f915bc5c065b7d25eb0" 2024-01-31T22:09:27.5632261Z }, 2024-01-31T22:09:27.5632621Z { 2024-01-31T22:09:27.5633125Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5633753Z "size": 106, 2024-01-31T22:09:27.5634349Z "digest": "sha256:10c7264f16e2053f9adb15a3dbd6148e857a92a334432b35ed1bb6cc76761820" 2024-01-31T22:09:27.5635026Z }, 2024-01-31T22:09:27.5635344Z { 2024-01-31T22:09:27.5635845Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5636478Z "size": 495, 2024-01-31T22:09:27.5637084Z "digest": "sha256:2eb59b280f8b62fc0b4a39454fe3f4349d1eb709d6b87b3a290b30c238ba1223" 2024-01-31T22:09:27.5637766Z }, 2024-01-31T22:09:27.5638078Z { 2024-01-31T22:09:27.5638578Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5639191Z "size": 121477177, 2024-01-31T22:09:27.5639813Z "digest": "sha256:24162b9a69978e9141bb616b0c1fa44f7f28c553bc1be472a0e2f904215b81be" 2024-01-31T22:09:27.5640490Z }, 2024-01-31T22:09:27.5640803Z { 2024-01-31T22:09:27.5641302Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5641920Z "size": 109, 2024-01-31T22:09:27.5642504Z "digest": "sha256:7a669a2e944544c369401dbe9fe7d62728528fd2e434a107f6363275a753a6e7" 2024-01-31T22:09:27.5643173Z }, 2024-01-31T22:09:27.5643491Z { 2024-01-31T22:09:27.5643987Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5644608Z "size": 558, 2024-01-31T22:09:27.5645461Z "digest": "sha256:18a6f0a160e7b4fafddf5a0916575577ed1c2223a8e31c76116a1fc5fa0423fd" 2024-01-31T22:09:27.5646147Z }, 2024-01-31T22:09:27.5646461Z { 2024-01-31T22:09:27.5646886Z + exit 0 2024-01-31T22:09:27.5647407Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5648031Z "size": 1175617, 2024-01-31T22:09:27.5648647Z "digest": "sha256:2a5d3623a8bd3be8558a5a6eb85906e0cf312132e98a336efd6afe56a2fd7b84" 2024-01-31T22:09:27.5649331Z }, 2024-01-31T22:09:27.5649747Z { 2024-01-31T22:09:27.5650247Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5650868Z "size": 103, 2024-01-31T22:09:27.5651498Z "digest": "sha256:60cc705c0a7f54bd73aeed1b2dbf9ea96efcdfc5d785413eef4b8477d10df51a" 2024-01-31T22:09:27.5652191Z }, 2024-01-31T22:09:27.5652512Z { 2024-01-31T22:09:27.5653018Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5653639Z "size": 1489, 2024-01-31T22:09:27.5654260Z "digest": "sha256:502ded0a4cc272bc757433ec51da9fd9ed459eb601c0cf1b082713b09ac3db98" 2024-01-31T22:09:27.5654951Z }, 2024-01-31T22:09:27.5655269Z { 2024-01-31T22:09:27.5655771Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5656398Z "size": 439855396, 2024-01-31T22:09:27.5657027Z "digest": "sha256:a9584c8d8daa9ddc774c347bb4fa42791eb9f422c43a44b9070e43c40cd24bb4" 2024-01-31T22:09:27.5657713Z }, 2024-01-31T22:09:27.5658035Z { 2024-01-31T22:09:27.5658543Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5659169Z "size": 160, 2024-01-31T22:09:27.5659780Z "digest": "sha256:d8b0c1606d41dedb8866362b866f25330e5438debb0926c20ee6bdf5c82ff90f" 2024-01-31T22:09:27.5660455Z }, 2024-01-31T22:09:27.5660779Z { 2024-01-31T22:09:27.5661286Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5661908Z "size": 564, 2024-01-31T22:09:27.5662597Z "digest": "sha256:a7cef78e941556a101c9e62d1fc37b391dfa579bdace81ff396b400ecfd4cd44" 2024-01-31T22:09:27.5663295Z }, 2024-01-31T22:09:27.5663611Z { 2024-01-31T22:09:27.5664120Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5664749Z "size": 33264369, 2024-01-31T22:09:27.5665344Z "digest": "sha256:b53875681b5b47726868867828148338531c898b367aeb47c906eea8fa28c0e0" 2024-01-31T22:09:27.5666002Z }, 2024-01-31T22:09:27.5666326Z { 2024-01-31T22:09:27.5666825Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5667455Z "size": 104, 2024-01-31T22:09:27.5668054Z "digest": "sha256:563889f5b8337a7828df3de7b9ad427762409e86d5a9286483b5ac121d268085" 2024-01-31T22:09:27.5668721Z }, 2024-01-31T22:09:27.5669047Z { 2024-01-31T22:09:27.5669553Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5670169Z "size": 425, 2024-01-31T22:09:27.5670785Z "digest": "sha256:fec002e849bffe09856f516e4f5256a3b72c9cb72b0f9330f9c69ab2a8cb2bfb" 2024-01-31T22:09:27.5671466Z }, 2024-01-31T22:09:27.5671782Z { 2024-01-31T22:09:27.5672287Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5672916Z "size": 20262048, 2024-01-31T22:09:27.5673541Z "digest": "sha256:0997871b0aa0c9af44a47af5057afdf63dbaa913b3cd8b2af90bb4e80a874200" 2024-01-31T22:09:27.5674231Z }, 2024-01-31T22:09:27.5674556Z { 2024-01-31T22:09:27.5675052Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5675680Z "size": 439, 2024-01-31T22:09:27.5676289Z "digest": "sha256:caa76df75bae724a38f3c430a42b325c71784201081bc96dd17e1786b5d80c48" 2024-01-31T22:09:27.5676964Z }, 2024-01-31T22:09:27.5677289Z { 2024-01-31T22:09:27.5677793Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5678410Z "size": 701, 2024-01-31T22:09:27.5679012Z "digest": "sha256:04bae84c1248a9df743ea3e6651b4e26c8385550229adc1a830766d4dc368f12" 2024-01-31T22:09:27.5679701Z }, 2024-01-31T22:09:27.5680016Z { 2024-01-31T22:09:27.5680520Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5681142Z "size": 140, 2024-01-31T22:09:27.5681743Z "digest": "sha256:3c343ac8ae12fab9064e9c7cb2644d57250263ab076a2c14a39250ee8c555bf3" 2024-01-31T22:09:27.5682427Z }, 2024-01-31T22:09:27.5682748Z { 2024-01-31T22:09:27.5683248Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5683871Z "size": 136, 2024-01-31T22:09:27.5684555Z "digest": "sha256:b4fd78a94e6e50409ef4fb0b86bdde48ea80b218d37a94b8bdd601c5720c0b1f" 2024-01-31T22:09:27.5685462Z }, 2024-01-31T22:09:27.5685780Z { 2024-01-31T22:09:27.5686301Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5686950Z "size": 32, 2024-01-31T22:09:27.5687593Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-01-31T22:09:27.5688334Z }, 2024-01-31T22:09:27.5688641Z { 2024-01-31T22:09:27.5689168Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5689822Z "size": 191, 2024-01-31T22:09:27.5690454Z "digest": "sha256:04a48576b446afdecf40d2cc8a77f16b14687686dff778e23308816e418d8aaa" 2024-01-31T22:09:27.5691185Z }, 2024-01-31T22:09:27.5691503Z { 2024-01-31T22:09:27.5692015Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5692725Z "size": 565, 2024-01-31T22:09:27.5693356Z "digest": "sha256:7b9381d5e5fb25645f62c528c0a3654fb90e8c590963d81cb83d175de4ae7d35" 2024-01-31T22:09:27.5694083Z }, 2024-01-31T22:09:27.5694393Z { 2024-01-31T22:09:27.5694910Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5695576Z "size": 43165784, 2024-01-31T22:09:27.5696221Z "digest": "sha256:e3c6890505309f9d55c387cfe643eb07136d1fe960c7025ab2d52aa8b3ff5fca" 2024-01-31T22:09:27.5696949Z }, 2024-01-31T22:09:27.5697265Z { 2024-01-31T22:09:27.5697877Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5698499Z "size": 106, 2024-01-31T22:09:27.5699096Z "digest": "sha256:9fba34398bbd7f5544e7bb2917296492c5f7af591b6a84d0a4c1d133198e4098" 2024-01-31T22:09:27.5699756Z }, 2024-01-31T22:09:27.5700072Z { 2024-01-31T22:09:27.5700566Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5701170Z "size": 1155, 2024-01-31T22:09:27.5701776Z "digest": "sha256:b4a66cb83da35d1e81e2a3d3660b454cde17c11f8bfa1ad4e9080150c5f5b5c9" 2024-01-31T22:09:27.5702463Z }, 2024-01-31T22:09:27.5702769Z { 2024-01-31T22:09:27.5703264Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5703904Z "size": 701, 2024-01-31T22:09:27.5704493Z "digest": "sha256:04bae84c1248a9df743ea3e6651b4e26c8385550229adc1a830766d4dc368f12" 2024-01-31T22:09:27.5705166Z }, 2024-01-31T22:09:27.5705484Z { 2024-01-31T22:09:27.5705978Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5706598Z "size": 138, 2024-01-31T22:09:27.5707200Z "digest": "sha256:50c25986af4101ef78f8203acdc218e25e7a6a90ff40fa5dc72dd644f369fde6" 2024-01-31T22:09:27.5707870Z }, 2024-01-31T22:09:27.5708188Z { 2024-01-31T22:09:27.5708684Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5709288Z "size": 120, 2024-01-31T22:09:27.5709884Z "digest": "sha256:d33ac22afed43999e8e0065e6c9d69f133893d81d44f90a62da057ad53fabe7e" 2024-01-31T22:09:27.5710559Z }, 2024-01-31T22:09:27.5710870Z { 2024-01-31T22:09:27.5711368Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5711988Z "size": 1625768750, 2024-01-31T22:09:27.5712606Z "digest": "sha256:94b1f01c8899c1b5b6a94676773ac1a013fe9c2d4cffc0f1f3cd8c8a9b0408b7" 2024-01-31T22:09:27.5713283Z }, 2024-01-31T22:09:27.5713599Z { 2024-01-31T22:09:27.5714088Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5714708Z "size": 171, 2024-01-31T22:09:27.5715301Z "digest": "sha256:65f8aa6b7f172c5540627c5363f4ea87e9543e13f60f7aa09c0cbabc31e753cf" 2024-01-31T22:09:27.5715961Z }, 2024-01-31T22:09:27.5716273Z { 2024-01-31T22:09:27.5716769Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5717374Z "size": 1840, 2024-01-31T22:09:27.5717969Z "digest": "sha256:d5e6634c432790e6a3e99dd7b5f501732172fac798c03e03a7b61459806547c4" 2024-01-31T22:09:27.5718626Z }, 2024-01-31T22:09:27.5719039Z { 2024-01-31T22:09:27.5719540Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5720156Z "size": 7529786, 2024-01-31T22:09:27.5720769Z "digest": "sha256:46ec2cdf7a22de47c947f66c7fc73eb87fc9a21cce5b3b205a7336c217e8c331" 2024-01-31T22:09:27.5721449Z }, 2024-01-31T22:09:27.5721760Z { 2024-01-31T22:09:27.5722250Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5722868Z "size": 165, 2024-01-31T22:09:27.5723460Z "digest": "sha256:6451065259413dd645ae2dd581c8c4c8d4286f8ed508b38c3a57f9288f6a2a4f" 2024-01-31T22:09:27.5724113Z }, 2024-01-31T22:09:27.5724426Z { 2024-01-31T22:09:27.5725131Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5725747Z "size": 7940, 2024-01-31T22:09:27.5726347Z "digest": "sha256:eead4008daa031ac2163b42bab5418071d9883460209fe9f5728b4395b0e8cd4" 2024-01-31T22:09:27.5727018Z }, 2024-01-31T22:09:27.5727329Z { 2024-01-31T22:09:27.5727834Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5728452Z "size": 8071, 2024-01-31T22:09:27.5729049Z "digest": "sha256:086713bad12c5b6749ca1138ae44d3383beb12d9d4d31b80f73b3ce7da74ff98" 2024-01-31T22:09:27.5729722Z }, 2024-01-31T22:09:27.5730038Z { 2024-01-31T22:09:27.5730528Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5731143Z "size": 305, 2024-01-31T22:09:27.5731834Z "digest": "sha256:a6f921a713792a969897537ab46dd9ff1e6126e7afa472aefda518a39da76ac0" 2024-01-31T22:09:27.5732501Z }, 2024-01-31T22:09:27.5732821Z { 2024-01-31T22:09:27.5733320Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5733931Z "size": 7618675, 2024-01-31T22:09:27.5734534Z "digest": "sha256:07da4866f0f72ddd67232d177503543c406b08df02e134ab0f0dd29744ae9ad9" 2024-01-31T22:09:27.5735200Z }, 2024-01-31T22:09:27.5735509Z { 2024-01-31T22:09:27.5736005Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5736624Z "size": 108, 2024-01-31T22:09:27.5737200Z "digest": "sha256:551d129979c42fe00781626381bc4ad04ca95823510e231f4f9b2f3838886b27" 2024-01-31T22:09:27.5737860Z }, 2024-01-31T22:09:27.5738176Z { 2024-01-31T22:09:27.5738668Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5739288Z "size": 54145480, 2024-01-31T22:09:27.5739921Z "digest": "sha256:981cedff2c35d9aa17e5adc1f74b6e0cb026fe7c2592cf1dfc9dee9ec55eb069" 2024-01-31T22:09:27.5740612Z }, 2024-01-31T22:09:27.5740930Z { 2024-01-31T22:09:27.5741429Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5742038Z "size": 498, 2024-01-31T22:09:27.5742637Z "digest": "sha256:af51204608116d8fd503bb305b6a6f10ee89e46da63359d249cd89de5c178680" 2024-01-31T22:09:27.5743314Z }, 2024-01-31T22:09:27.5743623Z { 2024-01-31T22:09:27.5744125Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5744751Z "size": 1496846313, 2024-01-31T22:09:27.5745353Z "digest": "sha256:4de4b2584f25390ef45261824855443e768559a979cfdfce412006e4a6867481" 2024-01-31T22:09:27.5746024Z }, 2024-01-31T22:09:27.5746343Z { 2024-01-31T22:09:27.5746836Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5747462Z "size": 106, 2024-01-31T22:09:27.5748068Z "digest": "sha256:2549feec2e55efcc90704c921916cd4b846d9eedab5261cf009d97192ebb9670" 2024-01-31T22:09:27.5748739Z }, 2024-01-31T22:09:27.5749058Z { 2024-01-31T22:09:27.5749560Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5750178Z "size": 445, 2024-01-31T22:09:27.5750807Z "digest": "sha256:fb0771ee647b27277d3256910e8dfee1abb7678988edc44bf6d30fd1dc38dad0" 2024-01-31T22:09:27.5751486Z }, 2024-01-31T22:09:27.5751804Z { 2024-01-31T22:09:27.5752303Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5752976Z "size": 46248510, 2024-01-31T22:09:27.5753737Z "digest": "sha256:6f3a3cca3ba48d69363ae70d8f7a90d3c580744cb5beb16b1f0b0d3670baa656" 2024-01-31T22:09:27.5754418Z }, 2024-01-31T22:09:27.5754737Z { 2024-01-31T22:09:27.5755243Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5755854Z "size": 111, 2024-01-31T22:09:27.5756448Z "digest": "sha256:357fa0717d4f57129acaf79f01642eb4b03f5a4d4d52156008454b3007eeae07" 2024-01-31T22:09:27.5757121Z }, 2024-01-31T22:09:27.5757436Z { 2024-01-31T22:09:27.5757940Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5758559Z "size": 32, 2024-01-31T22:09:27.5759153Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-01-31T22:09:27.5759835Z }, 2024-01-31T22:09:27.5760154Z { 2024-01-31T22:09:27.5760643Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5761259Z "size": 32, 2024-01-31T22:09:27.5761869Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-01-31T22:09:27.5762597Z }, 2024-01-31T22:09:27.5762907Z { 2024-01-31T22:09:27.5763406Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-01-31T22:09:27.5764024Z "size": 32, 2024-01-31T22:09:27.5764620Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-01-31T22:09:27.5765553Z } 2024-01-31T22:09:27.5765883Z ] 2024-01-31T22:09:27.5766196Z } 2024-01-31T22:09:27.5867332Z ##[group]Run tag=${ECR_DOCKER_IMAGE##*/} 2024-01-31T22:09:27.5867888Z tag=${ECR_DOCKER_IMAGE##*/} 2024-01-31T22:09:27.5868516Z echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" 2024-01-31T22:09:27.5876602Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.5877168Z env: 2024-01-31T22:09:27.5877524Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.5878611Z ECR_DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.5879735Z ##[endgroup] 2024-01-31T22:09:27.5902235Z docker pull ghcr.io/pytorch/ci-image:pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.5945333Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2024-01-31T22:09:27.5945983Z with: 2024-01-31T22:09:27.5946989Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.5948087Z env: 2024-01-31T22:09:27.5948442Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.5948861Z ##[endgroup] 2024-01-31T22:09:27.6012008Z ##[group]Run retry () { "$@" || (sleep 1 && "$@") || (sleep 2 && "$@") } 2024-01-31T22:09:27.6012767Z retry () { "$@" || (sleep 1 && "$@") || (sleep 2 && "$@") } 2024-01-31T22:09:27.6013524Z # ignore output since only exit code is used for conditional 2024-01-31T22:09:27.6014318Z # only pull docker image if it's not available locally 2024-01-31T22:09:27.6015153Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2024-01-31T22:09:27.6015950Z  retry docker pull "${DOCKER_IMAGE}" 2024-01-31T22:09:27.6016459Z fi 2024-01-31T22:09:27.6023713Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:09:27.6024289Z env: 2024-01-31T22:09:27.6024651Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:09:27.6025720Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:09:27.6026827Z ##[endgroup] 2024-01-31T22:09:27.9241318Z 3266cf2f0fa6c50bf5453a6ccbc1904de753c35f: Pulling from pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9 2024-01-31T22:09:27.9244418Z 7a2c55901189: Pulling fs layer 2024-01-31T22:09:27.9245294Z 224fe954d725: Pulling fs layer 2024-01-31T22:09:27.9246040Z 75722010b82e: Pulling fs layer 2024-01-31T22:09:27.9246468Z d527cbbb87e3: Pulling fs layer 2024-01-31T22:09:27.9246847Z b57676e46aee: Pulling fs layer 2024-01-31T22:09:27.9247270Z a8c1e85b5e14: Pulling fs layer 2024-01-31T22:09:27.9247773Z b57676e46aee: Waiting 2024-01-31T22:09:27.9248253Z a41a8d1c11c8: Pulling fs layer 2024-01-31T22:09:27.9248759Z 0c1227890755: Pulling fs layer 2024-01-31T22:09:27.9249265Z d8d1234baab3: Pulling fs layer 2024-01-31T22:09:27.9249769Z d527cbbb87e3: Waiting 2024-01-31T22:09:27.9250257Z 7ed32bc8e469: Pulling fs layer 2024-01-31T22:09:27.9250784Z ec1e7978c1fe: Pulling fs layer 2024-01-31T22:09:27.9251288Z e6c6c62213b7: Pulling fs layer 2024-01-31T22:09:27.9251793Z a72e1c3d7a64: Pulling fs layer 2024-01-31T22:09:27.9252286Z a8c1e85b5e14: Waiting 2024-01-31T22:09:27.9252742Z e38e3f87c2f7: Pulling fs layer 2024-01-31T22:09:27.9253246Z 7ed32bc8e469: Waiting 2024-01-31T22:09:27.9253681Z 4414a4cf055f: Pulling fs layer 2024-01-31T22:09:27.9254189Z 3a9f3bab87ac: Pulling fs layer 2024-01-31T22:09:27.9254683Z ec1e7978c1fe: Waiting 2024-01-31T22:09:27.9255110Z d8d1234baab3: Waiting 2024-01-31T22:09:27.9255567Z 8cb09e71074f: Pulling fs layer 2024-01-31T22:09:27.9256104Z dc1cd3eaed4b: Pulling fs layer 2024-01-31T22:09:27.9256638Z 709d1f5e9caf: Pulling fs layer 2024-01-31T22:09:27.9257164Z 04bae84c1248: Pulling fs layer 2024-01-31T22:09:27.9257671Z a72e1c3d7a64: Waiting 2024-01-31T22:09:27.9258102Z e6c6c62213b7: Waiting 2024-01-31T22:09:27.9258556Z 39c09b5c3871: Pulling fs layer 2024-01-31T22:09:27.9259047Z 4414a4cf055f: Waiting 2024-01-31T22:09:27.9259390Z 3a9f3bab87ac: Waiting 2024-01-31T22:09:27.9259821Z 8cb09e71074f: Waiting 2024-01-31T22:09:27.9260254Z 709d1f5e9caf: Waiting 2024-01-31T22:09:27.9260686Z dc1cd3eaed4b: Waiting 2024-01-31T22:09:27.9271108Z 04bae84c1248: Waiting 2024-01-31T22:09:27.9271615Z 39c09b5c3871: Waiting 2024-01-31T22:09:27.9272087Z e38e3f87c2f7: Waiting 2024-01-31T22:09:27.9272540Z e3a19dc71e24: Pulling fs layer 2024-01-31T22:09:27.9273026Z 2c6f9f6f78ae: Pulling fs layer 2024-01-31T22:09:27.9273517Z e3a19dc71e24: Waiting 2024-01-31T22:09:27.9273918Z 28bc578be056: Pulling fs layer 2024-01-31T22:09:27.9291045Z 2c6f9f6f78ae: Waiting 2024-01-31T22:09:27.9291394Z 28bc578be056: Waiting 2024-01-31T22:09:27.9292025Z 10c7264f16e2: Pulling fs layer 2024-01-31T22:09:27.9292458Z 2eb59b280f8b: Pulling fs layer 2024-01-31T22:09:27.9292821Z 10c7264f16e2: Waiting 2024-01-31T22:09:27.9293144Z 24162b9a6997: Pulling fs layer 2024-01-31T22:09:27.9293497Z 2eb59b280f8b: Waiting 2024-01-31T22:09:27.9293823Z 7a669a2e9445: Pulling fs layer 2024-01-31T22:09:27.9294188Z 18a6f0a160e7: Pulling fs layer 2024-01-31T22:09:27.9294559Z 2a5d3623a8bd: Pulling fs layer 2024-01-31T22:09:27.9294916Z 18a6f0a160e7: Waiting 2024-01-31T22:09:27.9295235Z 60cc705c0a7f: Pulling fs layer 2024-01-31T22:09:27.9295612Z 502ded0a4cc2: Pulling fs layer 2024-01-31T22:09:27.9295966Z 2a5d3623a8bd: Waiting 2024-01-31T22:09:27.9296284Z a9584c8d8daa: Pulling fs layer 2024-01-31T22:09:27.9296636Z 60cc705c0a7f: Waiting 2024-01-31T22:09:27.9296962Z d8b0c1606d41: Pulling fs layer 2024-01-31T22:09:27.9297340Z a7cef78e9415: Pulling fs layer 2024-01-31T22:09:27.9297705Z b53875681b5b: Pulling fs layer 2024-01-31T22:09:27.9298063Z 563889f5b833: Pulling fs layer 2024-01-31T22:09:27.9298436Z fec002e849bf: Pulling fs layer 2024-01-31T22:09:27.9298792Z d8b0c1606d41: Waiting 2024-01-31T22:09:27.9299092Z a7cef78e9415: Waiting 2024-01-31T22:09:27.9299399Z b53875681b5b: Waiting 2024-01-31T22:09:27.9299715Z 0997871b0aa0: Pulling fs layer 2024-01-31T22:09:27.9300080Z caa76df75bae: Pulling fs layer 2024-01-31T22:09:27.9300436Z 563889f5b833: Waiting 2024-01-31T22:09:27.9300756Z 3c343ac8ae12: Pulling fs layer 2024-01-31T22:09:27.9301224Z fec002e849bf: Waiting 2024-01-31T22:09:27.9301544Z b4fd78a94e6e: Pulling fs layer 2024-01-31T22:09:27.9301902Z caa76df75bae: Waiting 2024-01-31T22:09:27.9302267Z 0997871b0aa0: Waiting 2024-01-31T22:09:27.9302582Z 4f4fb700ef54: Pulling fs layer 2024-01-31T22:09:27.9302930Z 3c343ac8ae12: Waiting 2024-01-31T22:09:27.9303236Z b4fd78a94e6e: Waiting 2024-01-31T22:09:27.9303688Z 04a48576b446: Pulling fs layer 2024-01-31T22:09:27.9304054Z 7b9381d5e5fb: Pulling fs layer 2024-01-31T22:09:27.9304408Z 4f4fb700ef54: Waiting 2024-01-31T22:09:27.9304722Z e3c689050530: Pulling fs layer 2024-01-31T22:09:27.9305078Z 7b9381d5e5fb: Waiting 2024-01-31T22:09:27.9305406Z 9fba34398bbd: Pulling fs layer 2024-01-31T22:09:27.9305766Z b4a66cb83da3: Pulling fs layer 2024-01-31T22:09:27.9306116Z e3c689050530: Waiting 2024-01-31T22:09:27.9306429Z 50c25986af41: Pulling fs layer 2024-01-31T22:09:27.9306789Z d33ac22afed4: Pulling fs layer 2024-01-31T22:09:27.9307166Z 94b1f01c8899: Pulling fs layer 2024-01-31T22:09:27.9307515Z b4a66cb83da3: Waiting 2024-01-31T22:09:27.9307833Z 65f8aa6b7f17: Pulling fs layer 2024-01-31T22:09:27.9308199Z d5e6634c4327: Pulling fs layer 2024-01-31T22:09:27.9308652Z 65f8aa6b7f17: Waiting 2024-01-31T22:09:27.9309106Z d33ac22afed4: Waiting 2024-01-31T22:09:27.9309588Z 46ec2cdf7a22: Pulling fs layer 2024-01-31T22:09:27.9310103Z d5e6634c4327: Waiting 2024-01-31T22:09:27.9310569Z 645106525941: Pulling fs layer 2024-01-31T22:09:27.9311102Z eead4008daa0: Pulling fs layer 2024-01-31T22:09:27.9311619Z 086713bad12c: Pulling fs layer 2024-01-31T22:09:27.9312108Z 645106525941: Waiting 2024-01-31T22:09:27.9312550Z a6f921a71379: Pulling fs layer 2024-01-31T22:09:27.9313051Z 07da4866f0f7: Pulling fs layer 2024-01-31T22:09:27.9313556Z eead4008daa0: Waiting 2024-01-31T22:09:27.9314026Z 551d129979c4: Pulling fs layer 2024-01-31T22:09:27.9314556Z 981cedff2c35: Pulling fs layer 2024-01-31T22:09:27.9315076Z 086713bad12c: Waiting 2024-01-31T22:09:27.9315516Z 551d129979c4: Waiting 2024-01-31T22:09:27.9315957Z af5120460811: Pulling fs layer 2024-01-31T22:09:27.9316474Z 4de4b2584f25: Pulling fs layer 2024-01-31T22:09:27.9316999Z 2549feec2e55: Pulling fs layer 2024-01-31T22:09:27.9317488Z af5120460811: Waiting 2024-01-31T22:09:27.9317951Z fb0771ee647b: Pulling fs layer 2024-01-31T22:09:27.9318467Z 07da4866f0f7: Waiting 2024-01-31T22:09:27.9318931Z 6f3a3cca3ba4: Pulling fs layer 2024-01-31T22:09:27.9319457Z 4de4b2584f25: Waiting 2024-01-31T22:09:27.9319918Z 357fa0717d4f: Pulling fs layer 2024-01-31T22:09:27.9320410Z 6f3a3cca3ba4: Waiting 2024-01-31T22:09:27.9320839Z 2549feec2e55: Waiting 2024-01-31T22:09:27.9321377Z 357fa0717d4f: Waiting 2024-01-31T22:09:28.1723516Z 224fe954d725: Verifying Checksum 2024-01-31T22:09:28.1723940Z 224fe954d725: Download complete 2024-01-31T22:09:28.2449521Z d527cbbb87e3: Verifying Checksum 2024-01-31T22:09:28.2449937Z d527cbbb87e3: Download complete 2024-01-31T22:09:28.2775896Z 7a2c55901189: Verifying Checksum 2024-01-31T22:09:28.3247774Z b57676e46aee: Download complete 2024-01-31T22:09:28.4139371Z a41a8d1c11c8: Download complete 2024-01-31T22:09:28.5180682Z 0c1227890755: Verifying Checksum 2024-01-31T22:09:28.5181300Z 0c1227890755: Download complete 2024-01-31T22:09:28.5924196Z d8d1234baab3: Verifying Checksum 2024-01-31T22:09:28.5924785Z d8d1234baab3: Download complete 2024-01-31T22:09:28.6587169Z 75722010b82e: Verifying Checksum 2024-01-31T22:09:28.6587707Z 75722010b82e: Download complete 2024-01-31T22:09:28.7347753Z ec1e7978c1fe: Verifying Checksum 2024-01-31T22:09:28.7348285Z ec1e7978c1fe: Download complete 2024-01-31T22:09:28.9094075Z e6c6c62213b7: Verifying Checksum 2024-01-31T22:09:28.9094731Z e6c6c62213b7: Download complete 2024-01-31T22:09:28.9216016Z 7a2c55901189: Pull complete 2024-01-31T22:09:29.1637504Z 224fe954d725: Pull complete 2024-01-31T22:09:30.0309153Z 75722010b82e: Pull complete 2024-01-31T22:09:30.1143382Z d527cbbb87e3: Pull complete 2024-01-31T22:09:30.2041741Z b57676e46aee: Pull complete 2024-01-31T22:09:35.1407543Z a72e1c3d7a64: Verifying Checksum 2024-01-31T22:09:35.1408015Z a72e1c3d7a64: Download complete 2024-01-31T22:09:35.2284479Z e38e3f87c2f7: Verifying Checksum 2024-01-31T22:09:35.2285225Z e38e3f87c2f7: Download complete 2024-01-31T22:09:35.3460602Z 4414a4cf055f: Verifying Checksum 2024-01-31T22:09:35.4550837Z 3a9f3bab87ac: Verifying Checksum 2024-01-31T22:09:35.4551262Z 3a9f3bab87ac: Download complete 2024-01-31T22:09:37.7614561Z 8cb09e71074f: Verifying Checksum 2024-01-31T22:09:37.7615000Z 8cb09e71074f: Download complete 2024-01-31T22:09:37.8608387Z dc1cd3eaed4b: Verifying Checksum 2024-01-31T22:09:37.8608794Z dc1cd3eaed4b: Download complete 2024-01-31T22:09:37.9602173Z 709d1f5e9caf: Verifying Checksum 2024-01-31T22:09:37.9602584Z 709d1f5e9caf: Download complete 2024-01-31T22:09:38.0593061Z 04bae84c1248: Verifying Checksum 2024-01-31T22:09:38.0593478Z 04bae84c1248: Download complete 2024-01-31T22:09:41.6321745Z a8c1e85b5e14: Verifying Checksum 2024-01-31T22:09:41.6322905Z a8c1e85b5e14: Download complete 2024-01-31T22:09:41.7512415Z e3a19dc71e24: Download complete 2024-01-31T22:09:41.8524636Z 2c6f9f6f78ae: Download complete 2024-01-31T22:09:41.9525286Z 28bc578be056: Verifying Checksum 2024-01-31T22:09:41.9525841Z 28bc578be056: Download complete 2024-01-31T22:09:42.0898273Z 10c7264f16e2: Verifying Checksum 2024-01-31T22:09:42.0898802Z 10c7264f16e2: Download complete 2024-01-31T22:09:42.1768404Z 2eb59b280f8b: Verifying Checksum 2024-01-31T22:09:42.1768943Z 2eb59b280f8b: Download complete 2024-01-31T22:09:45.2329473Z 24162b9a6997: Verifying Checksum 2024-01-31T22:09:45.2329945Z 24162b9a6997: Download complete 2024-01-31T22:09:45.3228944Z 7a669a2e9445: Verifying Checksum 2024-01-31T22:09:45.3229379Z 7a669a2e9445: Download complete 2024-01-31T22:09:45.4138544Z 18a6f0a160e7: Verifying Checksum 2024-01-31T22:09:45.4139001Z 18a6f0a160e7: Download complete 2024-01-31T22:09:45.5773641Z 2a5d3623a8bd: Verifying Checksum 2024-01-31T22:09:45.5774176Z 2a5d3623a8bd: Download complete 2024-01-31T22:09:45.6508867Z 60cc705c0a7f: Verifying Checksum 2024-01-31T22:09:45.6509421Z 60cc705c0a7f: Download complete 2024-01-31T22:09:45.7316445Z 502ded0a4cc2: Verifying Checksum 2024-01-31T22:09:45.7316981Z 502ded0a4cc2: Download complete 2024-01-31T22:09:53.8486924Z a8c1e85b5e14: Pull complete 2024-01-31T22:09:53.9279099Z 7ed32bc8e469: Verifying Checksum 2024-01-31T22:09:53.9279756Z 7ed32bc8e469: Download complete 2024-01-31T22:09:54.0319953Z d8b0c1606d41: Download complete 2024-01-31T22:09:54.1041511Z a41a8d1c11c8: Pull complete 2024-01-31T22:09:54.1344411Z a7cef78e9415: Verifying Checksum 2024-01-31T22:09:54.1345105Z a7cef78e9415: Download complete 2024-01-31T22:09:54.2047515Z 0c1227890755: Pull complete 2024-01-31T22:09:54.3150807Z d8d1234baab3: Pull complete 2024-01-31T22:09:55.3495163Z b53875681b5b: Verifying Checksum 2024-01-31T22:09:55.3495641Z b53875681b5b: Download complete 2024-01-31T22:09:55.4233639Z 563889f5b833: Verifying Checksum 2024-01-31T22:09:55.4234268Z 563889f5b833: Download complete 2024-01-31T22:09:55.5151217Z fec002e849bf: Verifying Checksum 2024-01-31T22:09:55.5151813Z fec002e849bf: Download complete 2024-01-31T22:09:56.2600761Z 0997871b0aa0: Verifying Checksum 2024-01-31T22:09:56.2601231Z 0997871b0aa0: Download complete 2024-01-31T22:09:56.3608874Z caa76df75bae: Verifying Checksum 2024-01-31T22:09:56.3609505Z caa76df75bae: Download complete 2024-01-31T22:09:56.3772245Z a9584c8d8daa: Verifying Checksum 2024-01-31T22:09:56.3772764Z a9584c8d8daa: Download complete 2024-01-31T22:09:56.4537461Z b4fd78a94e6e: Verifying Checksum 2024-01-31T22:09:56.4538000Z b4fd78a94e6e: Download complete 2024-01-31T22:09:56.4623145Z 3c343ac8ae12: Verifying Checksum 2024-01-31T22:09:56.4623645Z 3c343ac8ae12: Download complete 2024-01-31T22:09:56.4634442Z 4f4fb700ef54: Verifying Checksum 2024-01-31T22:09:56.4634855Z 4f4fb700ef54: Download complete 2024-01-31T22:09:56.5398501Z 7b9381d5e5fb: Download complete 2024-01-31T22:09:56.5612919Z 04a48576b446: Download complete 2024-01-31T22:09:56.6555230Z 9fba34398bbd: Download complete 2024-01-31T22:09:56.7250094Z b4a66cb83da3: Verifying Checksum 2024-01-31T22:09:56.7250732Z b4a66cb83da3: Download complete 2024-01-31T22:09:56.8242421Z 50c25986af41: Verifying Checksum 2024-01-31T22:09:56.8242878Z 50c25986af41: Download complete 2024-01-31T22:09:56.9110773Z d33ac22afed4: Download complete 2024-01-31T22:09:57.6394204Z e3c689050530: Verifying Checksum 2024-01-31T22:09:57.6394835Z e3c689050530: Download complete 2024-01-31T22:09:57.7206402Z 65f8aa6b7f17: Verifying Checksum 2024-01-31T22:09:57.7206966Z 65f8aa6b7f17: Download complete 2024-01-31T22:09:57.8067618Z d5e6634c4327: Verifying Checksum 2024-01-31T22:09:57.8068142Z d5e6634c4327: Download complete 2024-01-31T22:09:58.1321537Z 46ec2cdf7a22: Verifying Checksum 2024-01-31T22:09:58.1322044Z 46ec2cdf7a22: Download complete 2024-01-31T22:09:58.2161175Z 645106525941: Verifying Checksum 2024-01-31T22:09:58.2161643Z 645106525941: Download complete 2024-01-31T22:09:58.3345023Z eead4008daa0: Download complete 2024-01-31T22:09:58.4396077Z 086713bad12c: Download complete 2024-01-31T22:09:58.5193905Z a6f921a71379: Verifying Checksum 2024-01-31T22:09:58.5194446Z a6f921a71379: Download complete 2024-01-31T22:09:58.8355207Z 07da4866f0f7: Verifying Checksum 2024-01-31T22:09:58.8355720Z 07da4866f0f7: Download complete 2024-01-31T22:09:58.9071967Z 551d129979c4: Verifying Checksum 2024-01-31T22:09:58.9072585Z 551d129979c4: Download complete 2024-01-31T22:10:00.6069394Z 981cedff2c35: Verifying Checksum 2024-01-31T22:10:00.6069892Z 981cedff2c35: Download complete 2024-01-31T22:10:00.7125932Z af5120460811: Download complete 2024-01-31T22:10:20.7421563Z 7ed32bc8e469: Pull complete 2024-01-31T22:10:20.9559077Z ec1e7978c1fe: Pull complete 2024-01-31T22:10:21.1824228Z e6c6c62213b7: Pull complete 2024-01-31T22:10:26.2752815Z a72e1c3d7a64: Pull complete 2024-01-31T22:10:26.5122442Z e38e3f87c2f7: Pull complete 2024-01-31T22:10:26.7505707Z 4414a4cf055f: Pull complete 2024-01-31T22:10:26.9948718Z 3a9f3bab87ac: Pull complete 2024-01-31T22:10:28.8183008Z 8cb09e71074f: Pull complete 2024-01-31T22:10:29.0324182Z dc1cd3eaed4b: Pull complete 2024-01-31T22:10:29.2535051Z 709d1f5e9caf: Pull complete 2024-01-31T22:10:29.4996011Z 04bae84c1248: Pull complete 2024-01-31T22:10:37.1953209Z 94b1f01c8899: Verifying Checksum 2024-01-31T22:10:37.1953704Z 94b1f01c8899: Download complete 2024-01-31T22:10:37.2795211Z 2549feec2e55: Verifying Checksum 2024-01-31T22:10:37.2795893Z 2549feec2e55: Download complete 2024-01-31T22:10:37.3590311Z fb0771ee647b: Verifying Checksum 2024-01-31T22:10:37.3590783Z fb0771ee647b: Download complete 2024-01-31T22:10:37.5645609Z 4de4b2584f25: Verifying Checksum 2024-01-31T22:10:37.5646036Z 4de4b2584f25: Download complete 2024-01-31T22:10:37.6456512Z 357fa0717d4f: Download complete 2024-01-31T22:10:38.5092695Z 6f3a3cca3ba4: Verifying Checksum 2024-01-31T22:10:38.5093234Z 6f3a3cca3ba4: Download complete 2024-01-31T22:10:41.4553538Z 39c09b5c3871: Verifying Checksum 2024-01-31T22:10:41.4554107Z 39c09b5c3871: Download complete 2024-01-31T22:11:14.1586528Z 39c09b5c3871: Pull complete 2024-01-31T22:11:14.3623452Z e3a19dc71e24: Pull complete 2024-01-31T22:11:14.5909490Z 2c6f9f6f78ae: Pull complete 2024-01-31T22:11:14.8141763Z 28bc578be056: Pull complete 2024-01-31T22:11:15.0292213Z 10c7264f16e2: Pull complete 2024-01-31T22:11:15.2637530Z 2eb59b280f8b: Pull complete 2024-01-31T22:11:18.3849504Z 24162b9a6997: Pull complete 2024-01-31T22:11:18.5904304Z 7a669a2e9445: Pull complete 2024-01-31T22:11:18.8147353Z 18a6f0a160e7: Pull complete 2024-01-31T22:11:19.0836854Z 2a5d3623a8bd: Pull complete 2024-01-31T22:11:19.2942068Z 60cc705c0a7f: Pull complete 2024-01-31T22:11:19.5090371Z 502ded0a4cc2: Pull complete 2024-01-31T22:11:25.4484142Z a9584c8d8daa: Pull complete 2024-01-31T22:11:25.6602059Z d8b0c1606d41: Pull complete 2024-01-31T22:11:25.8850003Z a7cef78e9415: Pull complete 2024-01-31T22:11:26.7150215Z b53875681b5b: Pull complete 2024-01-31T22:11:26.9017162Z 563889f5b833: Pull complete 2024-01-31T22:11:27.1266166Z fec002e849bf: Pull complete 2024-01-31T22:11:27.5940875Z 0997871b0aa0: Pull complete 2024-01-31T22:11:27.8144895Z caa76df75bae: Pull complete 2024-01-31T22:11:28.2426508Z 3c343ac8ae12: Pull complete 2024-01-31T22:11:28.4591449Z b4fd78a94e6e: Pull complete 2024-01-31T22:11:28.7027258Z 4f4fb700ef54: Pull complete 2024-01-31T22:11:28.9243153Z 04a48576b446: Pull complete 2024-01-31T22:11:29.1603793Z 7b9381d5e5fb: Pull complete 2024-01-31T22:11:30.4554557Z e3c689050530: Pull complete 2024-01-31T22:11:30.6449600Z 9fba34398bbd: Pull complete 2024-01-31T22:11:30.8618696Z b4a66cb83da3: Pull complete 2024-01-31T22:11:31.3109144Z 50c25986af41: Pull complete 2024-01-31T22:11:31.5348454Z d33ac22afed4: Pull complete 2024-01-31T22:11:56.7305742Z 94b1f01c8899: Pull complete 2024-01-31T22:11:56.9507277Z 65f8aa6b7f17: Pull complete 2024-01-31T22:11:57.1537874Z d5e6634c4327: Pull complete 2024-01-31T22:11:57.4582828Z 46ec2cdf7a22: Pull complete 2024-01-31T22:11:57.5734453Z 645106525941: Pull complete 2024-01-31T22:11:57.7723573Z eead4008daa0: Pull complete 2024-01-31T22:11:57.9999582Z 086713bad12c: Pull complete 2024-01-31T22:11:58.2344078Z a6f921a71379: Pull complete 2024-01-31T22:11:58.8711385Z 07da4866f0f7: Pull complete 2024-01-31T22:11:59.0701704Z 551d129979c4: Pull complete 2024-01-31T22:12:00.7773013Z 981cedff2c35: Pull complete 2024-01-31T22:12:01.0032706Z af5120460811: Pull complete 2024-01-31T22:12:15.7159722Z 4de4b2584f25: Pull complete 2024-01-31T22:12:15.9387028Z 2549feec2e55: Pull complete 2024-01-31T22:12:16.1573584Z fb0771ee647b: Pull complete 2024-01-31T22:12:16.9755723Z 6f3a3cca3ba4: Pull complete 2024-01-31T22:12:17.2197590Z 357fa0717d4f: Pull complete 2024-01-31T22:12:18.0650202Z Digest: sha256:2b2cf3c3f6d85ed39aac322cdfac02285cd0a07904c8f9184f94a19e3a0e9d49 2024-01-31T22:12:18.1091846Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:12:18.1313893Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:12:18.1459448Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2024-01-31T22:12:18.1459988Z with: 2024-01-31T22:12:18.1460272Z driver-version: 535.54.03 2024-01-31T22:12:18.1460613Z env: 2024-01-31T22:12:18.1460877Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:12:18.1461219Z ##[endgroup] 2024-01-31T22:12:18.1582080Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-01-31T22:12:18.1582649Z with: 2024-01-31T22:12:18.1582918Z timeout_minutes: 10 2024-01-31T22:12:18.1583243Z max_attempts: 3 2024-01-31T22:12:18.1613659Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" 2024-01-31T22:12:18.1644292Z retry_wait_seconds: 10 2024-01-31T22:12:18.1644656Z polling_interval_seconds: 1 2024-01-31T22:12:18.1645140Z warning_on_retry: true 2024-01-31T22:12:18.1645528Z continue_on_error: false 2024-01-31T22:12:18.1645866Z env: 2024-01-31T22:12:18.1646136Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:12:18.1646481Z DRIVER_VERSION: 535.54.03 2024-01-31T22:12:18.1646830Z ##[endgroup] 2024-01-31T22:12:18.2174163Z == Installing nvidia driver NVIDIA-Linux-x86_64-535.54.03.run == 2024-01-31T22:12:18.2175246Z + pre_install_nvidia_driver_amzn2 2024-01-31T22:12:18.2175952Z + sudo yum remove -y nvidia-driver-latest-dkms 2024-01-31T22:12:18.5132191Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-01-31T22:12:18.5513184Z No Match for argument: nvidia-driver-latest-dkms 2024-01-31T22:12:18.5752713Z No Packages marked for removal 2024-01-31T22:12:18.5871685Z + install_nvidia_driver_common 2024-01-31T22:12:18.5873863Z + echo 'Before installing NVIDIA driver' 2024-01-31T22:12:18.5874593Z + lspci 2024-01-31T22:12:18.5875007Z Before installing NVIDIA driver 2024-01-31T22:12:18.6071935Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2024-01-31T22:12:18.6072694Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2024-01-31T22:12:18.6073747Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2024-01-31T22:12:18.6074549Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2024-01-31T22:12:18.6075350Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061 2024-01-31T22:12:18.6076385Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2024-01-31T22:12:18.6077149Z 00:1e.0 3D controller: NVIDIA Corporation Device 2237 (rev a1) 2024-01-31T22:12:18.6077945Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2024-01-31T22:12:18.6078533Z + lsmod 2024-01-31T22:12:18.6086336Z Module Size Used by 2024-01-31T22:12:18.6086960Z ib_core 266240 0 2024-01-31T22:12:18.6087513Z nvidia_modeset 1306624 0 2024-01-31T22:12:18.6087956Z veth 16384 0 2024-01-31T22:12:18.6088339Z nvidia_uvm 1466368 0 2024-01-31T22:12:18.6088819Z nvidia 56422400 2 nvidia_uvm,nvidia_modeset 2024-01-31T22:12:18.6089309Z drm 421888 1 nvidia 2024-01-31T22:12:18.6089830Z i2c_core 77824 2 nvidia,drm 2024-01-31T22:12:18.6090411Z backlight 16384 1 nvidia_modeset 2024-01-31T22:12:18.6091104Z xt_conntrack 16384 1 2024-01-31T22:12:18.6091509Z ipt_MASQUERADE 16384 1 2024-01-31T22:12:18.6091932Z nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE 2024-01-31T22:12:18.6092394Z nf_conntrack_netlink 49152 0 2024-01-31T22:12:18.6092847Z nfnetlink 16384 2 nf_conntrack_netlink 2024-01-31T22:12:18.6093300Z xfrm_user 45056 1 2024-01-31T22:12:18.6093711Z xfrm_algo 16384 1 xfrm_user 2024-01-31T22:12:18.6094182Z iptable_nat 16384 1 2024-01-31T22:12:18.6095469Z nf_conntrack_ipv4 16384 3 2024-01-31T22:12:18.6096130Z nf_defrag_ipv4 16384 1 nf_conntrack_ipv4 2024-01-31T22:12:18.6096774Z nf_nat_ipv4 16384 1 iptable_nat 2024-01-31T22:12:18.6097329Z nf_nat 32768 2 nf_nat_masquerade_ipv4,nf_nat_ipv4 2024-01-31T22:12:18.6098548Z nf_conntrack 155648 7 xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4,nf_nat,ipt_MASQUERADE,nf_nat_ipv4,nf_conntrack_netlink 2024-01-31T22:12:18.6099585Z xt_addrtype 16384 2 2024-01-31T22:12:18.6099976Z iptable_filter 16384 1 2024-01-31T22:12:18.6100388Z br_netfilter 24576 0 2024-01-31T22:12:18.6100941Z bridge 172032 1 br_netfilter 2024-01-31T22:12:18.6101416Z stp 16384 1 bridge 2024-01-31T22:12:18.6101840Z llc 16384 2 bridge,stp 2024-01-31T22:12:18.6102264Z overlay 86016 0 2024-01-31T22:12:18.6102648Z sunrpc 393216 1 2024-01-31T22:12:18.6103031Z dm_mirror 28672 0 2024-01-31T22:12:18.6103551Z dm_region_hash 20480 1 dm_mirror 2024-01-31T22:12:18.6104213Z dm_log 20480 2 dm_region_hash,dm_mirror 2024-01-31T22:12:18.6104868Z dm_mod 143360 2 dm_log,dm_mirror 2024-01-31T22:12:18.6105446Z dax 69632 1 dm_mod 2024-01-31T22:12:18.6105982Z crc32_pclmul 16384 0 2024-01-31T22:12:18.6106410Z ghash_clmulni_intel 16384 0 2024-01-31T22:12:18.6106822Z pcbc 16384 0 2024-01-31T22:12:18.6107240Z aesni_intel 188416 0 2024-01-31T22:12:18.6107786Z aes_x86_64 20480 1 aesni_intel 2024-01-31T22:12:18.6108383Z crypto_simd 16384 1 aesni_intel 2024-01-31T22:12:18.6108885Z mousedev 24576 0 2024-01-31T22:12:18.6109326Z glue_helper 16384 1 aesni_intel 2024-01-31T22:12:18.6109892Z cryptd 28672 3 crypto_simd,ghash_clmulni_intel,aesni_intel 2024-01-31T22:12:18.6110701Z psmouse 32768 0 2024-01-31T22:12:18.6111077Z evdev 20480 3 2024-01-31T22:12:18.6111441Z button 16384 0 2024-01-31T22:12:18.6111812Z ena 135168 0 2024-01-31T22:12:18.6112178Z crc32c_intel 24576 0 2024-01-31T22:12:18.6112553Z autofs4 49152 2 2024-01-31T22:12:18.6112927Z + modinfo nvidia 2024-01-31T22:12:18.6113597Z filename: /lib/modules/4.14.328-248.540.amzn2.x86_64/kernel/drivers/video/nvidia.ko 2024-01-31T22:12:18.6114293Z firmware: nvidia/535.54.03/gsp_tu10x.bin 2024-01-31T22:12:18.6114781Z firmware: nvidia/535.54.03/gsp_ga10x.bin 2024-01-31T22:12:18.6115274Z alias: char-major-195-* 2024-01-31T22:12:18.6115650Z version: 535.54.03 2024-01-31T22:12:18.6115996Z supported: external 2024-01-31T22:12:18.6116332Z license: NVIDIA 2024-01-31T22:12:18.6116685Z srcversion: EA9C7EF32617E104C8240C4 2024-01-31T22:12:18.6117164Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2024-01-31T22:12:18.6117665Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2024-01-31T22:12:18.6118151Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2024-01-31T22:12:18.6118637Z depends: i2c-core,drm 2024-01-31T22:12:18.6118995Z retpoline: Y 2024-01-31T22:12:18.6119289Z name: nvidia 2024-01-31T22:12:18.6119843Z vermagic: 4.14.328-248.540.amzn2.x86_64 SMP mod_unload modversions 2024-01-31T22:12:18.6120503Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2024-01-31T22:12:18.6121225Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2024-01-31T22:12:18.6121830Z parm: NVreg_ResmanDebugLevel:int 2024-01-31T22:12:18.6122287Z parm: NVreg_RmLogonRC:int 2024-01-31T22:12:18.6122716Z parm: NVreg_ModifyDeviceFiles:int 2024-01-31T22:12:18.6123176Z parm: NVreg_DeviceFileUID:int 2024-01-31T22:12:18.6123617Z parm: NVreg_DeviceFileGID:int 2024-01-31T22:12:18.6124059Z parm: NVreg_DeviceFileMode:int 2024-01-31T22:12:18.6124591Z parm: NVreg_InitializeSystemMemoryAllocations:int 2024-01-31T22:12:18.6125482Z parm: NVreg_UsePageAttributeTable:int 2024-01-31T22:12:18.6125963Z parm: NVreg_EnablePCIeGen3:int 2024-01-31T22:12:18.6126411Z parm: NVreg_EnableMSI:int 2024-01-31T22:12:18.6126832Z parm: NVreg_TCEBypassMode:int 2024-01-31T22:12:18.6127288Z parm: NVreg_EnableStreamMemOPs:int 2024-01-31T22:12:18.6127824Z parm: NVreg_RestrictProfilingToAdminUsers:int 2024-01-31T22:12:18.6128422Z parm: NVreg_PreserveVideoMemoryAllocations:int 2024-01-31T22:12:18.6128978Z parm: NVreg_EnableS0ixPowerManagement:int 2024-01-31T22:12:18.6129592Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2024-01-31T22:12:18.6130199Z parm: NVreg_DynamicPowerManagement:int 2024-01-31T22:12:18.6130813Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2024-01-31T22:12:18.6131432Z parm: NVreg_EnableGpuFirmware:int 2024-01-31T22:12:18.6131926Z parm: NVreg_EnableGpuFirmwareLogs:int 2024-01-31T22:12:18.6132461Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2024-01-31T22:12:18.6132996Z parm: NVreg_EnableUserNUMAManagement:int 2024-01-31T22:12:18.6133495Z parm: NVreg_MemoryPoolSize:int 2024-01-31T22:12:18.6133962Z parm: NVreg_KMallocHeapMaxSize:int 2024-01-31T22:12:18.6134443Z parm: NVreg_VMallocHeapMaxSize:int 2024-01-31T22:12:18.6134920Z parm: NVreg_IgnoreMMIOCheck:int 2024-01-31T22:12:18.6135369Z parm: NVreg_NvLinkDisable:int 2024-01-31T22:12:18.6135865Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2024-01-31T22:12:18.6136389Z parm: NVreg_RegisterPCIDriver:int 2024-01-31T22:12:18.6136864Z parm: NVreg_EnableResizableBar:int 2024-01-31T22:12:18.6137347Z parm: NVreg_EnableDbgBreakpoint:int 2024-01-31T22:12:18.6137955Z parm: NVreg_RegistryDwords:charp 2024-01-31T22:12:18.6138463Z parm: NVreg_RegistryDwordsPerDevice:charp 2024-01-31T22:12:18.6138937Z parm: NVreg_RmMsg:charp 2024-01-31T22:12:18.6139353Z parm: NVreg_GpuBlacklist:charp 2024-01-31T22:12:18.6139827Z parm: NVreg_TemporaryFilePath:charp 2024-01-31T22:12:18.6140292Z parm: NVreg_ExcludedGpus:charp 2024-01-31T22:12:18.6140752Z parm: NVreg_DmaRemapPeerMmio:int 2024-01-31T22:12:18.6141234Z parm: NVreg_RmNvlinkBandwidth:charp 2024-01-31T22:12:18.6141699Z parm: rm_firmware_active:charp 2024-01-31T22:12:18.6142117Z + HAS_NVIDIA_DRIVER=0 2024-01-31T22:12:18.6142496Z ++ command -v nvidia-smi 2024-01-31T22:12:18.6142882Z + '[' -x /usr/bin/nvidia-smi ']' 2024-01-31T22:12:18.6143263Z + set +e 2024-01-31T22:12:18.6143805Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2024-01-31T22:12:21.0228036Z + INSTALLED_DRIVER_VERSION=535.54.03 2024-01-31T22:12:21.0228712Z + NVIDIA_SMI_STATUS=0 2024-01-31T22:12:21.0229251Z + '[' 0 -ne 0 ']' 2024-01-31T22:12:21.0229637Z + '[' 535.54.03 '!=' 535.54.03 ']' 2024-01-31T22:12:21.0230035Z + HAS_NVIDIA_DRIVER=1 2024-01-31T22:12:21.0230799Z + echo 'NVIDIA driver (535.54.03) has already been installed. Skipping NVIDIA driver installation' 2024-01-31T22:12:21.0231561Z + set -e 2024-01-31T22:12:21.0231871Z + '[' 1 -eq 0 ']' 2024-01-31T22:12:21.0232470Z NVIDIA driver (535.54.03) has already been installed. Skipping NVIDIA driver installation 2024-01-31T22:12:21.0233208Z + post_install_nvidia_driver_common 2024-01-31T22:12:21.0233924Z + sudo modprobe nvidia 2024-01-31T22:12:21.0321111Z + echo 'After installing NVIDIA driver' 2024-01-31T22:12:21.0321572Z + lspci 2024-01-31T22:12:21.0321884Z After installing NVIDIA driver 2024-01-31T22:12:21.0404768Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2024-01-31T22:12:21.0405670Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2024-01-31T22:12:21.0406843Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2024-01-31T22:12:21.0407736Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2024-01-31T22:12:21.0408493Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061 2024-01-31T22:12:21.0409239Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2024-01-31T22:12:21.0409958Z 00:1e.0 3D controller: NVIDIA Corporation Device 2237 (rev a1) 2024-01-31T22:12:21.0410741Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2024-01-31T22:12:21.0411338Z + lsmod 2024-01-31T22:12:21.0418288Z Module Size Used by 2024-01-31T22:12:21.0418747Z ib_core 266240 0 2024-01-31T22:12:21.0419214Z nvidia_modeset 1306624 0 2024-01-31T22:12:21.0419655Z veth 16384 0 2024-01-31T22:12:21.0420113Z nvidia_uvm 1466368 0 2024-01-31T22:12:21.0420598Z nvidia 56422400 2 nvidia_uvm,nvidia_modeset 2024-01-31T22:12:21.0421093Z drm 421888 1 nvidia 2024-01-31T22:12:21.0421509Z i2c_core 77824 2 nvidia,drm 2024-01-31T22:12:21.0421972Z backlight 16384 1 nvidia_modeset 2024-01-31T22:12:21.0422413Z xt_conntrack 16384 1 2024-01-31T22:12:21.0422786Z ipt_MASQUERADE 16384 1 2024-01-31T22:12:21.0423223Z nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE 2024-01-31T22:12:21.0423687Z nf_conntrack_netlink 49152 0 2024-01-31T22:12:21.0424124Z nfnetlink 16384 2 nf_conntrack_netlink 2024-01-31T22:12:21.0424589Z xfrm_user 45056 1 2024-01-31T22:12:21.0424980Z xfrm_algo 16384 1 xfrm_user 2024-01-31T22:12:21.0425423Z iptable_nat 16384 1 2024-01-31T22:12:21.0425902Z nf_conntrack_ipv4 16384 3 2024-01-31T22:12:21.0426339Z nf_defrag_ipv4 16384 1 nf_conntrack_ipv4 2024-01-31T22:12:21.0426881Z nf_nat_ipv4 16384 1 iptable_nat 2024-01-31T22:12:21.0427459Z nf_nat 32768 2 nf_nat_masquerade_ipv4,nf_nat_ipv4 2024-01-31T22:12:21.0428788Z nf_conntrack 155648 7 xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4,nf_nat,ipt_MASQUERADE,nf_nat_ipv4,nf_conntrack_netlink 2024-01-31T22:12:21.0429619Z xt_addrtype 16384 2 2024-01-31T22:12:21.0430003Z iptable_filter 16384 1 2024-01-31T22:12:21.0430384Z br_netfilter 24576 0 2024-01-31T22:12:21.0430788Z bridge 172032 1 br_netfilter 2024-01-31T22:12:21.0431228Z stp 16384 1 bridge 2024-01-31T22:12:21.0431648Z llc 16384 2 bridge,stp 2024-01-31T22:12:21.0432064Z overlay 86016 0 2024-01-31T22:12:21.0432432Z sunrpc 393216 1 2024-01-31T22:12:21.0432803Z dm_mirror 28672 0 2024-01-31T22:12:21.0433190Z dm_region_hash 20480 1 dm_mirror 2024-01-31T22:12:21.0433673Z dm_log 20480 2 dm_region_hash,dm_mirror 2024-01-31T22:12:21.0434181Z dm_mod 143360 2 dm_log,dm_mirror 2024-01-31T22:12:21.0434624Z dax 69632 1 dm_mod 2024-01-31T22:12:21.0435114Z crc32_pclmul 16384 0 2024-01-31T22:12:21.0435502Z ghash_clmulni_intel 16384 0 2024-01-31T22:12:21.0435913Z pcbc 16384 0 2024-01-31T22:12:21.0436287Z aesni_intel 188416 0 2024-01-31T22:12:21.0436685Z aes_x86_64 20480 1 aesni_intel 2024-01-31T22:12:21.0437191Z crypto_simd 16384 1 aesni_intel 2024-01-31T22:12:21.0437612Z mousedev 24576 0 2024-01-31T22:12:21.0438015Z glue_helper 16384 1 aesni_intel 2024-01-31T22:12:21.0438702Z cryptd 28672 3 crypto_simd,ghash_clmulni_intel,aesni_intel 2024-01-31T22:12:21.0439242Z psmouse 32768 0 2024-01-31T22:12:21.0439608Z evdev 20480 3 2024-01-31T22:12:21.0439975Z button 16384 0 2024-01-31T22:12:21.0440326Z ena 135168 0 2024-01-31T22:12:21.0440694Z crc32c_intel 24576 0 2024-01-31T22:12:21.0441073Z autofs4 49152 2 2024-01-31T22:12:21.0441424Z + modinfo nvidia 2024-01-31T22:12:21.0442070Z filename: /lib/modules/4.14.328-248.540.amzn2.x86_64/kernel/drivers/video/nvidia.ko 2024-01-31T22:12:21.0442761Z firmware: nvidia/535.54.03/gsp_tu10x.bin 2024-01-31T22:12:21.0443236Z firmware: nvidia/535.54.03/gsp_ga10x.bin 2024-01-31T22:12:21.0443726Z alias: char-major-195-* 2024-01-31T22:12:21.0444106Z version: 535.54.03 2024-01-31T22:12:21.0444443Z supported: external 2024-01-31T22:12:21.0444780Z license: NVIDIA 2024-01-31T22:12:21.0445419Z srcversion: EA9C7EF32617E104C8240C4 2024-01-31T22:12:21.0445886Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2024-01-31T22:12:21.0446389Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2024-01-31T22:12:21.0446887Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2024-01-31T22:12:21.0447370Z depends: i2c-core,drm 2024-01-31T22:12:21.0447731Z retpoline: Y 2024-01-31T22:12:21.0448050Z name: nvidia 2024-01-31T22:12:21.0448595Z vermagic: 4.14.328-248.540.amzn2.x86_64 SMP mod_unload modversions 2024-01-31T22:12:21.0449257Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2024-01-31T22:12:21.0449901Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2024-01-31T22:12:21.0450495Z parm: NVreg_ResmanDebugLevel:int 2024-01-31T22:12:21.0450933Z parm: NVreg_RmLogonRC:int 2024-01-31T22:12:21.0451368Z parm: NVreg_ModifyDeviceFiles:int 2024-01-31T22:12:21.0451825Z parm: NVreg_DeviceFileUID:int 2024-01-31T22:12:21.0452261Z parm: NVreg_DeviceFileGID:int 2024-01-31T22:12:21.0452705Z parm: NVreg_DeviceFileMode:int 2024-01-31T22:12:21.0453237Z parm: NVreg_InitializeSystemMemoryAllocations:int 2024-01-31T22:12:21.0453794Z parm: NVreg_UsePageAttributeTable:int 2024-01-31T22:12:21.0454279Z parm: NVreg_EnablePCIeGen3:int 2024-01-31T22:12:21.0454712Z parm: NVreg_EnableMSI:int 2024-01-31T22:12:21.0455213Z parm: NVreg_TCEBypassMode:int 2024-01-31T22:12:21.0455670Z parm: NVreg_EnableStreamMemOPs:int 2024-01-31T22:12:21.0456203Z parm: NVreg_RestrictProfilingToAdminUsers:int 2024-01-31T22:12:21.0456784Z parm: NVreg_PreserveVideoMemoryAllocations:int 2024-01-31T22:12:21.0457352Z parm: NVreg_EnableS0ixPowerManagement:int 2024-01-31T22:12:21.0457960Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2024-01-31T22:12:21.0458553Z parm: NVreg_DynamicPowerManagement:int 2024-01-31T22:12:21.0459172Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2024-01-31T22:12:21.0459768Z parm: NVreg_EnableGpuFirmware:int 2024-01-31T22:12:21.0460253Z parm: NVreg_EnableGpuFirmwareLogs:int 2024-01-31T22:12:21.0460791Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2024-01-31T22:12:21.0461346Z parm: NVreg_EnableUserNUMAManagement:int 2024-01-31T22:12:21.0461845Z parm: NVreg_MemoryPoolSize:int 2024-01-31T22:12:21.0462313Z parm: NVreg_KMallocHeapMaxSize:int 2024-01-31T22:12:21.0462796Z parm: NVreg_VMallocHeapMaxSize:int 2024-01-31T22:12:21.0463268Z parm: NVreg_IgnoreMMIOCheck:int 2024-01-31T22:12:21.0463720Z parm: NVreg_NvLinkDisable:int 2024-01-31T22:12:21.0464228Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2024-01-31T22:12:21.0464748Z parm: NVreg_RegisterPCIDriver:int 2024-01-31T22:12:21.0465235Z parm: NVreg_EnableResizableBar:int 2024-01-31T22:12:21.0465813Z parm: NVreg_EnableDbgBreakpoint:int 2024-01-31T22:12:21.0466294Z parm: NVreg_RegistryDwords:charp 2024-01-31T22:12:21.0466796Z parm: NVreg_RegistryDwordsPerDevice:charp 2024-01-31T22:12:21.0467279Z parm: NVreg_RmMsg:charp 2024-01-31T22:12:21.0467692Z parm: NVreg_GpuBlacklist:charp 2024-01-31T22:12:21.0468154Z parm: NVreg_TemporaryFilePath:charp 2024-01-31T22:12:21.0468632Z parm: NVreg_ExcludedGpus:charp 2024-01-31T22:12:21.0469092Z parm: NVreg_DmaRemapPeerMmio:int 2024-01-31T22:12:21.0469564Z parm: NVreg_RmNvlinkBandwidth:charp 2024-01-31T22:12:21.0470027Z parm: rm_firmware_active:charp 2024-01-31T22:12:21.0470431Z + set +e 2024-01-31T22:12:21.0470721Z + nvidia-smi 2024-01-31T22:12:23.1741161Z Wed Jan 31 22:12:23 2024 2024-01-31T22:12:23.1741936Z +---------------------------------------------------------------------------------------+ 2024-01-31T22:12:23.1742799Z | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | 2024-01-31T22:12:23.1743579Z |-----------------------------------------+----------------------+----------------------+ 2024-01-31T22:12:23.1744347Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2024-01-31T22:12:23.1745215Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2024-01-31T22:12:23.1745921Z | | | MIG M. | 2024-01-31T22:12:23.1746451Z |=========================================+======================+======================| 2024-01-31T22:12:23.1806049Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2024-01-31T22:12:23.1806754Z | 0% 24C P0 52W / 300W | 4MiB / 23028MiB | 5% Default | 2024-01-31T22:12:23.1807382Z | | | N/A | 2024-01-31T22:12:23.1808241Z +-----------------------------------------+----------------------+----------------------+ 2024-01-31T22:12:23.1808975Z 2024-01-31T22:12:23.1809621Z +---------------------------------------------------------------------------------------+ 2024-01-31T22:12:23.1810327Z | Processes: | 2024-01-31T22:12:23.1811424Z | GPU GI CI PID Type Process name GPU Memory | 2024-01-31T22:12:23.1812100Z | ID ID Usage | 2024-01-31T22:12:23.1812657Z |=======================================================================================| 2024-01-31T22:12:23.1813286Z | No running processes found | 2024-01-31T22:12:23.1814031Z +---------------------------------------------------------------------------------------+ 2024-01-31T22:12:23.6745868Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2024-01-31T22:12:25.7966209Z NVIDIA A10G 2024-01-31T22:12:26.1032872Z + NVIDIA_SMI_STATUS=0 2024-01-31T22:12:26.1033578Z + '[' 0 -eq 0 ']' 2024-01-31T22:12:26.1034121Z + echo 'INFO: Ignoring allowed status 0' 2024-01-31T22:12:26.1034744Z + set -e 2024-01-31T22:12:26.1035107Z INFO: Ignoring allowed status 0 2024-01-31T22:12:26.1036935Z == Installing nvidia container toolkit for amzn2 == 2024-01-31T22:12:26.1038665Z + sudo yum install -y yum-utils 2024-01-31T22:12:26.4009063Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-01-31T22:12:27.8524735Z Package yum-utils-1.1.31-46.amzn2.0.1.noarch already installed and latest version 2024-01-31T22:12:27.8525609Z Nothing to do 2024-01-31T22:12:28.0001676Z + sudo yum-config-manager --add-repo https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo 2024-01-31T22:12:28.3167218Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-01-31T22:12:28.3436792Z adding repo from: https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo 2024-01-31T22:12:28.3438888Z grabbing file https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo to /etc/yum.repos.d/nvidia-docker.repo 2024-01-31T22:12:28.3439835Z repo saved to /etc/yum.repos.d/nvidia-docker.repo 2024-01-31T22:12:28.3548409Z + sudo yum install -y nvidia-docker2 2024-01-31T22:12:28.6463255Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-01-31T22:12:30.1159501Z Package nvidia-docker2-2.13.0-1.noarch already installed and latest version 2024-01-31T22:12:30.1160151Z Nothing to do 2024-01-31T22:12:30.2651704Z + sudo systemctl restart docker 2024-01-31T22:13:20.2798749Z Command completed after 1 attempt(s). 2024-01-31T22:13:20.2857385Z ##[group]Run python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84 2024-01-31T22:13:20.2858150Z python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84 2024-01-31T22:13:20.2858847Z python3 -m tools.stats.monitor > usage_log.txt 2>&1 & 2024-01-31T22:13:20.2859516Z echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}" 2024-01-31T22:13:20.2867775Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:20.2868279Z env: 2024-01-31T22:13:20.2868556Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:20.2869017Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:20.2869526Z ##[endgroup] 2024-01-31T22:13:20.5229866Z Defaulting to user installation because normal site-packages is not writeable 2024-01-31T22:13:20.5379579Z Requirement already satisfied: psutil==5.9.1 in /home/ec2-user/.local/lib/python3.7/site-packages (5.9.1) 2024-01-31T22:13:20.5476196Z Requirement already satisfied: nvidia-ml-py==11.525.84 in /home/ec2-user/.local/lib/python3.7/site-packages (11.525.84) 2024-01-31T22:13:20.6365177Z Prepare all required actions 2024-01-31T22:13:20.6366061Z Getting action download info 2024-01-31T22:13:20.7859280Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2024-01-31T22:13:21.0133309Z Download action repository 'actions/download-artifact@v3' (SHA:9bc31d5ccc31df68ecc42ccf4149144866c47d8a) 2024-01-31T22:13:21.1494375Z ##[group]Run ./.github/actions/download-build-artifacts 2024-01-31T22:13:21.1494857Z with: 2024-01-31T22:13:21.1495224Z name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:13:21.1495857Z env: 2024-01-31T22:13:21.1496140Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:21.1496613Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:21.1497097Z ##[endgroup] 2024-01-31T22:13:21.1530049Z ##[group]Run seemethere/download-artifact-s3@v4 2024-01-31T22:13:21.1530486Z with: 2024-01-31T22:13:21.1530850Z name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:13:21.1531388Z s3-bucket: gha-artifacts 2024-01-31T22:13:21.1531730Z region: us-east-1 2024-01-31T22:13:21.1532025Z env: 2024-01-31T22:13:21.1532297Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:21.1532735Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:21.1533209Z ##[endgroup] 2024-01-31T22:13:21.5747051Z (node:18775) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2024-01-31T22:13:21.5747761Z 2024-01-31T22:13:21.5748028Z Please migrate your code to use AWS SDK for JavaScript (v3). 2024-01-31T22:13:21.5748783Z For more information, check the migration guide at https://a.co/7PzMCcy 2024-01-31T22:13:21.5749643Z (Use `node --trace-warnings ...` to show where the warning was created) 2024-01-31T22:13:21.6654139Z Found 1 objects with prefix pytorch/pytorch/7731552918/linux-focal-cuda12.1-py3-gcc9-slow-gradcheck/ 2024-01-31T22:13:21.6655242Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2024-01-31T22:13:43.4817285Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2024-01-31T22:13:43.4823356Z Artifact download has finished successfully 2024-01-31T22:13:43.4984630Z ##[group]Run unzip -o artifacts.zip 2024-01-31T22:13:43.4985073Z unzip -o artifacts.zip 2024-01-31T22:13:43.4992955Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:43.4993445Z env: 2024-01-31T22:13:43.4993711Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:43.4994159Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:43.4994649Z ##[endgroup] 2024-01-31T22:13:43.5022246Z Archive: artifacts.zip 2024-01-31T22:13:43.5023770Z creating: dist/ 2024-01-31T22:13:45.4107850Z inflating: dist/torch-2.3.0a0+git86d2ab3-cp310-cp310-linux_x86_64.whl 2024-01-31T22:13:45.4108540Z creating: build/custom_test_artifacts/ 2024-01-31T22:13:45.4109139Z creating: build/custom_test_artifacts/custom-op-build/ 2024-01-31T22:13:45.4109874Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2024-01-31T22:13:45.4110716Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2024-01-31T22:13:45.4115714Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2024-01-31T22:13:45.4116771Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/ 2024-01-31T22:13:45.4117723Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-01-31T22:13:45.4118939Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-01-31T22:13:45.4120278Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-01-31T22:13:45.4121398Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-01-31T22:13:45.4122676Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-01-31T22:13:45.4123958Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-01-31T22:13:45.4125348Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-01-31T22:13:45.4126642Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-01-31T22:13:45.4128037Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-01-31T22:13:45.4129399Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-01-31T22:13:45.4130789Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-01-31T22:13:45.4131952Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-01-31T22:13:45.4133133Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-01-31T22:13:45.4134170Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-01-31T22:13:45.4135176Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-01-31T22:13:45.4167771Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-01-31T22:13:45.4205727Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-01-31T22:13:45.4207198Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-01-31T22:13:45.4250794Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-01-31T22:13:45.4252705Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-01-31T22:13:45.4254586Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-01-31T22:13:45.4256520Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-01-31T22:13:45.4258163Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-01-31T22:13:45.4259557Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-01-31T22:13:45.4261026Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-01-31T22:13:45.4262416Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-01-31T22:13:45.4263772Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-01-31T22:13:45.4265050Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-01-31T22:13:45.4266301Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-01-31T22:13:45.4267523Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-01-31T22:13:45.4268745Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-01-31T22:13:45.4269945Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-01-31T22:13:45.4271163Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-01-31T22:13:45.4317173Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-01-31T22:13:45.4375100Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-01-31T22:13:45.4376702Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-01-31T22:13:45.4378002Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2024-01-31T22:13:45.4379061Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2024-01-31T22:13:45.4380188Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2024-01-31T22:13:45.4381400Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2024-01-31T22:13:45.4382428Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2024-01-31T22:13:45.4383577Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2024-01-31T22:13:45.4384683Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2024-01-31T22:13:45.4385724Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2024-01-31T22:13:45.4386803Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2024-01-31T22:13:45.4387872Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2024-01-31T22:13:45.4388952Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2024-01-31T22:13:45.4390089Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2024-01-31T22:13:45.4391149Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2024-01-31T22:13:45.4400528Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2024-01-31T22:13:45.4501388Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2024-01-31T22:13:45.4503075Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2024-01-31T22:13:45.4504634Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2024-01-31T22:13:45.4506085Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2024-01-31T22:13:45.4507235Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2024-01-31T22:13:45.4508336Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2024-01-31T22:13:45.4509445Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2024-01-31T22:13:45.4510578Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2024-01-31T22:13:45.4511711Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2024-01-31T22:13:45.4512832Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2024-01-31T22:13:45.4513927Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2024-01-31T22:13:45.4525305Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2024-01-31T22:13:45.4604553Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2024-01-31T22:13:45.4606160Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-01-31T22:13:45.4607453Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2024-01-31T22:13:45.4608549Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2024-01-31T22:13:45.4609591Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2024-01-31T22:13:45.4610875Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2024-01-31T22:13:45.4611780Z inflating: build/custom_test_artifacts/custom-op-build/detect_cuda_version.cc 2024-01-31T22:13:45.4612626Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2024-01-31T22:13:45.4613408Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2024-01-31T22:13:45.4614202Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2024-01-31T22:13:45.4695237Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2024-01-31T22:13:45.4756256Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2024-01-31T22:13:45.4757721Z creating: build/custom_test_artifacts/jit-hook-build/ 2024-01-31T22:13:45.4759106Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2024-01-31T22:13:45.4760279Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2024-01-31T22:13:45.4762861Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2024-01-31T22:13:45.4763900Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/ 2024-01-31T22:13:45.4765071Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-01-31T22:13:45.4766319Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-01-31T22:13:45.4767612Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-01-31T22:13:45.4768705Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-01-31T22:13:45.4770080Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-01-31T22:13:45.4771224Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-01-31T22:13:45.4772249Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-01-31T22:13:45.4773562Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-01-31T22:13:45.4774881Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-01-31T22:13:45.4776226Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-01-31T22:13:45.4777370Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-01-31T22:13:45.4778531Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-01-31T22:13:45.4779682Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-01-31T22:13:45.4780756Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-01-31T22:13:45.4781756Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-01-31T22:13:45.4815225Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-01-31T22:13:45.4853378Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-01-31T22:13:45.4854797Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-01-31T22:13:45.4897587Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-01-31T22:13:45.4899474Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-01-31T22:13:45.4901424Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-01-31T22:13:45.4903360Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-01-31T22:13:45.4905024Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-01-31T22:13:45.4906417Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-01-31T22:13:45.4907913Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-01-31T22:13:45.4909273Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-01-31T22:13:45.4910659Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-01-31T22:13:45.4911935Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-01-31T22:13:45.4913152Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-01-31T22:13:45.4914348Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-01-31T22:13:45.4915559Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-01-31T22:13:45.4916742Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-01-31T22:13:45.4917938Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-01-31T22:13:45.4964750Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-01-31T22:13:45.5021869Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-01-31T22:13:45.5023534Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-01-31T22:13:45.5024745Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2024-01-31T22:13:45.5025794Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2024-01-31T22:13:45.5027018Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2024-01-31T22:13:45.5028113Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2024-01-31T22:13:45.5029180Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2024-01-31T22:13:45.5030412Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2024-01-31T22:13:45.5031556Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2024-01-31T22:13:45.5032612Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2024-01-31T22:13:45.5033715Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2024-01-31T22:13:45.5034815Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2024-01-31T22:13:45.5035918Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2024-01-31T22:13:45.5037016Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2024-01-31T22:13:45.5038107Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2024-01-31T22:13:45.5047467Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2024-01-31T22:13:45.5108303Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2024-01-31T22:13:45.5110201Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-01-31T22:13:45.5112676Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2024-01-31T22:13:45.5114839Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2024-01-31T22:13:45.5116607Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2024-01-31T22:13:45.5118367Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2024-01-31T22:13:45.5120127Z inflating: build/custom_test_artifacts/jit-hook-build/detect_cuda_version.cc 2024-01-31T22:13:45.5120956Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2024-01-31T22:13:45.5121710Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2024-01-31T22:13:45.5122505Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2024-01-31T22:13:45.5161240Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2024-01-31T22:13:45.5162019Z creating: build/custom_test_artifacts/custom-backend-build/ 2024-01-31T22:13:45.5162774Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2024-01-31T22:13:45.5163647Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2024-01-31T22:13:45.5168969Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2024-01-31T22:13:45.5170232Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/ 2024-01-31T22:13:45.5171221Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-01-31T22:13:45.5172421Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-01-31T22:13:45.5173455Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-01-31T22:13:45.5174615Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-01-31T22:13:45.5175781Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-01-31T22:13:45.5176960Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-01-31T22:13:45.5178005Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-01-31T22:13:45.5179217Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-01-31T22:13:45.5180559Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-01-31T22:13:45.5181809Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-01-31T22:13:45.5183077Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-01-31T22:13:45.5184297Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-01-31T22:13:45.5185522Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-01-31T22:13:45.5186620Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-01-31T22:13:45.5187679Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-01-31T22:13:45.5223065Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-01-31T22:13:45.5261342Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-01-31T22:13:45.5262989Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-01-31T22:13:45.5305936Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-01-31T22:13:45.5307429Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-01-31T22:13:45.5309043Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-01-31T22:13:45.5310735Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-01-31T22:13:45.5312280Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-01-31T22:13:45.5313736Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-01-31T22:13:45.5315183Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-01-31T22:13:45.5316620Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-01-31T22:13:45.5318022Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-01-31T22:13:45.5319363Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-01-31T22:13:45.5320715Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-01-31T22:13:45.5321986Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-01-31T22:13:45.5323265Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-01-31T22:13:45.5324506Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-01-31T22:13:45.5326012Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-01-31T22:13:45.5376234Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-01-31T22:13:45.5434306Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-01-31T22:13:45.5435564Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-01-31T22:13:45.5436607Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2024-01-31T22:13:45.5437513Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2024-01-31T22:13:45.5438472Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2024-01-31T22:13:45.5439540Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2024-01-31T22:13:45.5440728Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2024-01-31T22:13:45.5441969Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2024-01-31T22:13:45.5443178Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2024-01-31T22:13:45.5444318Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2024-01-31T22:13:45.5445635Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2024-01-31T22:13:45.5446935Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2024-01-31T22:13:45.5448124Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2024-01-31T22:13:45.5449309Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2024-01-31T22:13:45.5450562Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2024-01-31T22:13:45.5451797Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2024-01-31T22:13:45.5565572Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2024-01-31T22:13:45.5567285Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2024-01-31T22:13:45.5569016Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2024-01-31T22:13:45.5571050Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2024-01-31T22:13:45.5572868Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2024-01-31T22:13:45.5574555Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2024-01-31T22:13:45.5576304Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2024-01-31T22:13:45.5578061Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2024-01-31T22:13:45.5579808Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2024-01-31T22:13:45.5581565Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2024-01-31T22:13:45.5583263Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2024-01-31T22:13:45.5593012Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2024-01-31T22:13:45.5645881Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2024-01-31T22:13:45.5647786Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-01-31T22:13:45.5649450Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2024-01-31T22:13:45.5651004Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2024-01-31T22:13:45.5652358Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2024-01-31T22:13:45.5653765Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2024-01-31T22:13:45.5655194Z inflating: build/custom_test_artifacts/custom-backend-build/detect_cuda_version.cc 2024-01-31T22:13:45.5656486Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2024-01-31T22:13:45.5657646Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2024-01-31T22:13:45.5658827Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2024-01-31T22:13:45.5752942Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2024-01-31T22:13:45.5794138Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2024-01-31T22:13:45.5795039Z creating: build/lib/ 2024-01-31T22:13:45.5880417Z inflating: build/lib/libprotobuf-lite.a 2024-01-31T22:13:45.6319661Z inflating: build/lib/libprotobuf.a 2024-01-31T22:13:45.6329321Z inflating: build/lib/libpthreadpool.a 2024-01-31T22:13:45.6337498Z inflating: build/lib/libcpuinfo.a 2024-01-31T22:13:45.6345275Z inflating: build/lib/libcpuinfo_internals.a 2024-01-31T22:13:45.6345987Z inflating: build/lib/libclog.a 2024-01-31T22:13:45.6363867Z inflating: build/lib/libnnpack.a 2024-01-31T22:13:45.6366327Z inflating: build/lib/libnnpack_reference_layers.a 2024-01-31T22:13:45.6429154Z inflating: build/lib/libgtest.a 2024-01-31T22:13:45.6501563Z inflating: build/lib/libbenchmark.a 2024-01-31T22:13:45.6563602Z inflating: build/lib/libasmjit.a 2024-01-31T22:13:45.6571296Z inflating: build/lib/libittnotify.a 2024-01-31T22:13:45.6598274Z inflating: build/lib/libtensorpipe_uv.a 2024-01-31T22:13:45.6723060Z inflating: build/lib/libgloo.a 2024-01-31T22:13:45.6744667Z inflating: build/lib/libfmt.a 2024-01-31T22:13:45.6830266Z inflating: build/lib/libc10.so 2024-01-31T22:13:45.6832162Z inflating: build/lib/libcaffe2_nvrtc.so 2024-01-31T22:13:45.6832831Z inflating: build/lib/libfoxi_loader.a 2024-01-31T22:13:45.6834115Z inflating: build/lib/libtorch_global_deps.so 2024-01-31T22:13:45.6847768Z inflating: build/lib/libqnnpack.a 2024-01-31T22:13:45.7333384Z inflating: build/lib/libprotoc.a 2024-01-31T22:13:45.7352198Z inflating: build/lib/libpytorch_qnnpack.a 2024-01-31T22:13:45.7369991Z inflating: build/lib/libgmock.a 2024-01-31T22:13:45.7370629Z inflating: build/lib/libgtest_main.a 2024-01-31T22:13:45.7371497Z inflating: build/lib/libbenchmark_main.a 2024-01-31T22:13:45.7741354Z inflating: build/lib/libgloo_cuda.a 2024-01-31T22:13:46.6854522Z inflating: build/lib/libdnnl.a 2024-01-31T22:13:46.7406681Z inflating: build/lib/libtensorpipe.a 2024-01-31T22:13:46.7457560Z inflating: build/lib/libc10_cuda.so 2024-01-31T22:13:46.7458018Z inflating: build/lib/libgmock_main.a 2024-01-31T22:13:46.8663228Z inflating: build/lib/libfbgemm.a 2024-01-31T22:13:46.8908397Z inflating: build/lib/libtensorpipe_cuda.a 2024-01-31T22:13:46.9401997Z inflating: build/lib/libkineto.a 2024-01-31T22:13:46.9440030Z inflating: build/lib/libcaffe2_protos.a 2024-01-31T22:13:46.9479199Z inflating: build/lib/libonnx_proto.a 2024-01-31T22:13:46.9650214Z inflating: build/lib/libXNNPACK.a 2024-01-31T22:13:47.0301917Z inflating: build/lib/libonnx.a 2024-01-31T22:13:49.2665860Z inflating: build/lib/libtorch_cpu.so 2024-01-31T22:13:49.2669614Z inflating: build/lib/libunbox_lib.a 2024-01-31T22:13:51.1555441Z inflating: build/lib/libtorch_cuda.so 2024-01-31T22:13:51.1556427Z inflating: build/lib/libtorch.so 2024-01-31T22:13:51.1557560Z inflating: build/lib/libc10d_cuda_test.so 2024-01-31T22:13:51.9512397Z inflating: build/lib/libtorch_cuda_linalg.so 2024-01-31T22:13:51.9516439Z inflating: build/lib/libshm.so 2024-01-31T22:13:51.9568330Z inflating: build/lib/libtorchbind_test.so 2024-01-31T22:13:51.9587926Z inflating: build/lib/libjitbackend_test.so 2024-01-31T22:13:51.9612002Z inflating: build/lib/libbackend_with_compiler.so 2024-01-31T22:13:52.1361902Z inflating: build/lib/libtorch_python.so 2024-01-31T22:13:52.1395287Z inflating: build/lib/libnnapi_backend.so 2024-01-31T22:13:52.1395742Z creating: build/bin/ 2024-01-31T22:13:52.1444263Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2024-01-31T22:13:52.1493935Z inflating: build/bin/c10_DeviceGuard_test 2024-01-31T22:13:52.1542457Z inflating: build/bin/c10_Device_test 2024-01-31T22:13:52.1600371Z inflating: build/bin/c10_DispatchKeySet_test 2024-01-31T22:13:52.1652040Z inflating: build/bin/c10_Scalar_test 2024-01-31T22:13:52.1699788Z inflating: build/bin/c10_StreamGuard_test 2024-01-31T22:13:52.1749166Z inflating: build/bin/c10_SymInt_test 2024-01-31T22:13:52.1802193Z inflating: build/bin/c10_InlineDeviceGuard_test 2024-01-31T22:13:52.1855803Z inflating: build/bin/c10_InlineStreamGuard_test 2024-01-31T22:13:52.1910387Z inflating: build/bin/c10_SizesAndStrides_test 2024-01-31T22:13:52.1979220Z inflating: build/bin/c10_cow_test 2024-01-31T22:13:52.2030852Z inflating: build/bin/c10_Bitset_test 2024-01-31T22:13:52.2079339Z inflating: build/bin/c10_ConstexprCrc_test 2024-01-31T22:13:52.2126691Z inflating: build/bin/c10_DeadlockDetection_test 2024-01-31T22:13:52.2174733Z inflating: build/bin/c10_Half_test 2024-01-31T22:13:52.2229085Z inflating: build/bin/c10_LeftRight_test 2024-01-31T22:13:52.2282571Z inflating: build/bin/c10_Metaprogramming_test 2024-01-31T22:13:52.2331048Z inflating: build/bin/c10_Synchronized_test 2024-01-31T22:13:52.2384816Z inflating: build/bin/c10_ThreadLocal_test 2024-01-31T22:13:52.2434881Z inflating: build/bin/c10_TypeIndex_test 2024-01-31T22:13:52.2485370Z inflating: build/bin/c10_TypeList_test 2024-01-31T22:13:52.2532220Z inflating: build/bin/c10_TypeTraits_test 2024-01-31T22:13:52.2582035Z inflating: build/bin/c10_accumulate_test 2024-01-31T22:13:52.2635709Z inflating: build/bin/c10_bfloat16_test 2024-01-31T22:13:52.2684584Z inflating: build/bin/c10_bit_cast_test 2024-01-31T22:13:52.2739404Z inflating: build/bin/c10_complex_math_test 2024-01-31T22:13:52.2793868Z inflating: build/bin/c10_complex_test 2024-01-31T22:13:52.2844255Z inflating: build/bin/c10_exception_test 2024-01-31T22:13:52.2893220Z inflating: build/bin/c10_flags_test 2024-01-31T22:13:52.2941730Z inflating: build/bin/c10_generic_math_test 2024-01-31T22:13:52.3100767Z inflating: build/bin/c10_intrusive_ptr_test 2024-01-31T22:13:52.3149818Z inflating: build/bin/c10_irange_test 2024-01-31T22:13:52.3203832Z inflating: build/bin/c10_logging_test 2024-01-31T22:13:52.3277054Z inflating: build/bin/c10_optional_test 2024-01-31T22:13:52.3336880Z inflating: build/bin/c10_ordered_preserving_dict_test 2024-01-31T22:13:52.3389014Z inflating: build/bin/c10_registry_test 2024-01-31T22:13:52.3439051Z inflating: build/bin/c10_ssize_test 2024-01-31T22:13:52.3584941Z inflating: build/bin/c10_small_vector_test 2024-01-31T22:13:52.3634349Z inflating: build/bin/c10_tempfile_test 2024-01-31T22:13:52.3690706Z inflating: build/bin/c10_string_view_test 2024-01-31T22:13:52.3737685Z inflating: build/bin/c10_intrusive_ptr_benchmark 2024-01-31T22:13:52.3792072Z inflating: build/bin/c10_typeid_test 2024-01-31T22:13:52.4225216Z inflating: build/bin/protoc-3.13.0.0 2024-01-31T22:13:52.4658355Z inflating: build/bin/protoc 2024-01-31T22:13:52.4709666Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_1_var_test 2024-01-31T22:13:52.4761651Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_stream 2024-01-31T22:13:52.4812334Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2024-01-31T22:13:52.4862749Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_from_2_processes 2024-01-31T22:13:52.4914925Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2024-01-31T22:13:52.4965362Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2024-01-31T22:13:52.5012789Z inflating: build/bin/c10_cuda_CUDATest 2024-01-31T22:13:52.5063893Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2024-01-31T22:13:52.5353829Z inflating: build/bin/vec_test_all_types_DEFAULT 2024-01-31T22:13:52.5658253Z inflating: build/bin/vec_test_all_types_AVX512 2024-01-31T22:13:52.5977856Z inflating: build/bin/vec_test_all_types_AVX2 2024-01-31T22:13:52.6029352Z inflating: build/bin/FileStoreTest 2024-01-31T22:13:52.6083342Z inflating: build/bin/TCPStoreTest 2024-01-31T22:13:52.6135201Z inflating: build/bin/HashStoreTest 2024-01-31T22:13:52.6148059Z inflating: build/bin/ProcessGroupMPITest 2024-01-31T22:13:52.6200880Z inflating: build/bin/test_edge_op_registration 2024-01-31T22:13:52.6202519Z inflating: build/bin/example_allreduce 2024-01-31T22:13:52.6256221Z inflating: build/bin/test_dist_autograd 2024-01-31T22:13:52.6323851Z inflating: build/bin/test_cpp_rpc 2024-01-31T22:13:52.6326094Z inflating: build/bin/parallel_benchmark 2024-01-31T22:13:52.6391144Z inflating: build/bin/test_mobile_nnc 2024-01-31T22:13:52.6399974Z inflating: build/bin/aot_model_compiler_test 2024-01-31T22:13:52.6732836Z inflating: build/bin/test_lazy 2024-01-31T22:13:52.6804005Z inflating: build/bin/Dict_test 2024-01-31T22:13:52.6855212Z inflating: build/bin/Dimname_test 2024-01-31T22:13:52.6917864Z inflating: build/bin/MaybeOwned_test 2024-01-31T22:13:52.6973073Z inflating: build/bin/NamedTensor_test 2024-01-31T22:13:52.7030303Z inflating: build/bin/apply_utils_test 2024-01-31T22:13:52.7087398Z inflating: build/bin/atest 2024-01-31T22:13:52.7148936Z inflating: build/bin/basic 2024-01-31T22:13:52.8285673Z inflating: build/bin/test_api 2024-01-31T22:13:52.8339005Z inflating: build/bin/broadcast_test 2024-01-31T22:13:52.8388126Z inflating: build/bin/cpu_allocator_test 2024-01-31T22:13:52.8444276Z inflating: build/bin/cpu_generator_test 2024-01-31T22:13:52.8496001Z inflating: build/bin/cpu_profiling_allocator_test 2024-01-31T22:13:52.8585948Z inflating: build/bin/cpu_rng_test 2024-01-31T22:13:52.8635320Z inflating: build/bin/dispatch_key_set_test 2024-01-31T22:13:52.8683655Z inflating: build/bin/dlconvertor_test 2024-01-31T22:13:52.8739633Z inflating: build/bin/extension_backend_test 2024-01-31T22:13:52.8793415Z inflating: build/bin/half_test 2024-01-31T22:13:52.8885856Z inflating: build/bin/ivalue_test 2024-01-31T22:13:52.8933448Z inflating: build/bin/lazy_tensor_test 2024-01-31T22:13:52.8986096Z inflating: build/bin/math_kernel_test 2024-01-31T22:13:52.9039185Z inflating: build/bin/memory_format_test 2024-01-31T22:13:52.9090589Z inflating: build/bin/memory_overlapping_test 2024-01-31T22:13:52.9141443Z inflating: build/bin/mobile_memory_cleanup 2024-01-31T22:13:52.9191012Z inflating: build/bin/operator_name_test 2024-01-31T22:13:52.9245640Z inflating: build/bin/native_test 2024-01-31T22:13:52.9295318Z inflating: build/bin/operators_test 2024-01-31T22:13:52.9345465Z inflating: build/bin/packedtensoraccessor_test 2024-01-31T22:13:52.9410657Z inflating: build/bin/pow_test 2024-01-31T22:13:52.9465984Z inflating: build/bin/quantized_test 2024-01-31T22:13:52.9514675Z inflating: build/bin/reduce_ops_test 2024-01-31T22:13:52.9564135Z inflating: build/bin/reportMemoryUsage_test 2024-01-31T22:13:52.9618378Z inflating: build/bin/scalar_tensor_test 2024-01-31T22:13:52.9675672Z inflating: build/bin/scalar_test 2024-01-31T22:13:52.9725137Z inflating: build/bin/StorageUtils_test 2024-01-31T22:13:52.9776230Z inflating: build/bin/stride_properties_test 2024-01-31T22:13:52.9852616Z inflating: build/bin/tensor_iterator_test 2024-01-31T22:13:52.9904984Z inflating: build/bin/test_parallel 2024-01-31T22:13:52.9907556Z inflating: build/bin/thread_init_test 2024-01-31T22:13:52.9962191Z inflating: build/bin/type_ptr_test 2024-01-31T22:13:53.0020040Z inflating: build/bin/type_test 2024-01-31T22:13:53.0070667Z inflating: build/bin/undefined_tensor_test 2024-01-31T22:13:53.0071723Z inflating: build/bin/verify_api_visibility 2024-01-31T22:13:53.0139509Z inflating: build/bin/legacy_vmap_test 2024-01-31T22:13:53.0189429Z inflating: build/bin/weakref_test 2024-01-31T22:13:53.0239818Z inflating: build/bin/wrapdim_test 2024-01-31T22:13:53.0289606Z inflating: build/bin/xla_tensor_test 2024-01-31T22:13:53.0392319Z inflating: build/bin/List_test 2024-01-31T22:13:53.0450980Z inflating: build/bin/IListRef_test 2024-01-31T22:13:53.0515466Z inflating: build/bin/KernelFunction_test 2024-01-31T22:13:53.0633498Z inflating: build/bin/kernel_function_legacy_test 2024-01-31T22:13:53.0726738Z inflating: build/bin/kernel_function_test 2024-01-31T22:13:53.0849945Z inflating: build/bin/kernel_lambda_legacy_test 2024-01-31T22:13:53.0950388Z inflating: build/bin/kernel_lambda_test 2024-01-31T22:13:53.1009721Z inflating: build/bin/kernel_stackbased_test 2024-01-31T22:13:53.1103208Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2024-01-31T22:13:53.1152226Z inflating: build/bin/CppSignature_test 2024-01-31T22:13:53.1206458Z inflating: build/bin/backend_fallback_test 2024-01-31T22:13:53.1254049Z inflating: build/bin/op_allowlist_test 2024-01-31T22:13:53.1544982Z inflating: build/bin/op_registration_test 2024-01-31T22:13:53.1606829Z inflating: build/bin/inline_container_test 2024-01-31T22:13:53.1658226Z inflating: build/bin/cuda_apply_test 2024-01-31T22:13:53.1707438Z inflating: build/bin/cuda_allocator_test 2024-01-31T22:13:53.1764445Z inflating: build/bin/cuda_atomic_ops_test 2024-01-31T22:13:53.1817254Z inflating: build/bin/cuda_caching_host_allocator_test 2024-01-31T22:13:53.1884764Z inflating: build/bin/cuda_complex_math_test 2024-01-31T22:13:53.1942594Z inflating: build/bin/cuda_complex_test 2024-01-31T22:13:53.1990681Z inflating: build/bin/cuda_device_test 2024-01-31T22:13:53.2046218Z inflating: build/bin/cuda_cub_test 2024-01-31T22:13:53.2095458Z inflating: build/bin/cuda_dlconvertor_test 2024-01-31T22:13:53.2158063Z inflating: build/bin/cuda_distributions_test 2024-01-31T22:13:53.2213215Z inflating: build/bin/cuda_generator_test 2024-01-31T22:13:53.2261447Z inflating: build/bin/cuda_half_test 2024-01-31T22:13:53.2311047Z inflating: build/bin/cuda_integer_divider_test 2024-01-31T22:13:53.2359466Z inflating: build/bin/cuda_optional_test 2024-01-31T22:13:53.2408683Z inflating: build/bin/cuda_packedtensoraccessor_test 2024-01-31T22:13:53.2459514Z inflating: build/bin/cuda_reportMemoryUsage_test 2024-01-31T22:13:53.2507781Z inflating: build/bin/cuda_allocatorTraceTracker_test 2024-01-31T22:13:53.2566419Z inflating: build/bin/cuda_stream_test 2024-01-31T22:13:53.2614351Z inflating: build/bin/cuda_cudnn_test 2024-01-31T22:13:53.2664813Z inflating: build/bin/cuda_vectorized_test 2024-01-31T22:13:53.2679200Z inflating: build/bin/tutorial_tensorexpr 2024-01-31T22:13:53.2742920Z inflating: build/bin/ProcessGroupGlooTest 2024-01-31T22:13:53.2798284Z inflating: build/bin/ProcessGroupGlooAsyncTest 2024-01-31T22:13:53.2859157Z inflating: build/bin/ProcessGroupNCCLTest 2024-01-31T22:13:53.2917497Z inflating: build/bin/ProcessGroupNCCLErrorsTest 2024-01-31T22:13:53.3722767Z inflating: build/bin/test_tensorexpr 2024-01-31T22:13:53.3726755Z inflating: build/bin/torch_shm_manager 2024-01-31T22:13:53.4271791Z inflating: build/bin/test_jit 2024-01-31T22:13:53.4272635Z creating: .additional_ci_files/ 2024-01-31T22:13:53.4302041Z inflating: .additional_ci_files/test-times.json 2024-01-31T22:13:53.4422672Z inflating: .additional_ci_files/test-class-times.json 2024-01-31T22:13:53.5626599Z inflating: .additional_ci_files/test-file-ratings.json 2024-01-31T22:13:53.8086991Z inflating: .additional_ci_files/test-class-ratings.json 2024-01-31T22:13:53.8249407Z inflating: .additional_ci_files/td_heuristic_historical_edited_files.json 2024-01-31T22:13:53.8321287Z inflating: .additional_ci_files/td_heuristic_profiling.json 2024-01-31T22:13:53.8322658Z inflating: .additional_ci_files/previous_failures.json 2024-01-31T22:13:53.8346429Z ##[group]Run df -H 2024-01-31T22:13:53.8346740Z df -H 2024-01-31T22:13:53.8356020Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:53.8356508Z env: 2024-01-31T22:13:53.8356774Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:53.8357215Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:53.8357682Z ##[endgroup] 2024-01-31T22:13:53.8383758Z Filesystem Size Used Avail Use% Mounted on 2024-01-31T22:13:53.8384467Z devtmpfs 34G 0 34G 0% /dev 2024-01-31T22:13:53.8384945Z tmpfs 34G 0 34G 0% /dev/shm 2024-01-31T22:13:53.8399001Z tmpfs 34G 459k 34G 1% /run 2024-01-31T22:13:53.8399583Z tmpfs 34G 0 34G 0% /sys/fs/cgroup 2024-01-31T22:13:53.8400071Z /dev/nvme0n1p1 162G 36G 126G 22% / 2024-01-31T22:13:53.8427253Z ##[group]Run .github/scripts/parse_ref.py 2024-01-31T22:13:53.8427729Z .github/scripts/parse_ref.py 2024-01-31T22:13:53.8435594Z shell: /usr/bin/bash -e {0} 2024-01-31T22:13:53.8435940Z env: 2024-01-31T22:13:53.8436218Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:53.8436678Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:53.8437154Z ##[endgroup] 2024-01-31T22:13:53.8657126Z Prepare all required actions 2024-01-31T22:13:53.8694054Z ##[group]Run ./.github/actions/get-workflow-job-id 2024-01-31T22:13:53.8694513Z with: 2024-01-31T22:13:53.8695192Z github-token: *** 2024-01-31T22:13:53.8695500Z env: 2024-01-31T22:13:53.8695784Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:53.8696253Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:53.8696742Z ##[endgroup] 2024-01-31T22:13:53.8711902Z ##[group]Run set -eux 2024-01-31T22:13:53.8712244Z set -eux 2024-01-31T22:13:53.8712832Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2024-01-31T22:13:53.8720264Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:53.8720768Z env: 2024-01-31T22:13:53.8721045Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:53.8721491Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:53.8722130Z GITHUB_TOKEN: *** 2024-01-31T22:13:53.8722440Z ##[endgroup] 2024-01-31T22:13:53.8741615Z + python3 .github/scripts/get_workflow_job_id.py 7731552918 i-0f7dd096d568eb180 2024-01-31T22:13:54.7494312Z setting job-id=21083466405 2024-01-31T22:13:54.7495267Z setting job-name=linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:13:54.7685814Z Prepare all required actions 2024-01-31T22:13:54.7686269Z Getting action download info 2024-01-31T22:13:54.8984007Z ##[group]Run ./.github/actions/filter-test-configs 2024-01-31T22:13:54.8984469Z with: 2024-01-31T22:13:54.8984918Z github-token: *** 2024-01-31T22:13:54.8986662Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}]} 2024-01-31T22:13:54.8988881Z job-name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:13:54.8989670Z env: 2024-01-31T22:13:54.8989955Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:54.8990415Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:54.8990937Z ##[endgroup] 2024-01-31T22:13:54.9032696Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-01-31T22:13:54.9033252Z with: 2024-01-31T22:13:54.9033510Z shell: bash 2024-01-31T22:13:54.9033810Z timeout_minutes: 10 2024-01-31T22:13:54.9034136Z max_attempts: 5 2024-01-31T22:13:54.9034446Z retry_wait_seconds: 30 2024-01-31T22:13:54.9035219Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore python3 -m pip install requests==2.26.0 pyyaml==6.0.1 2024-01-31T22:13:54.9036050Z polling_interval_seconds: 1 2024-01-31T22:13:54.9036428Z warning_on_retry: true 2024-01-31T22:13:54.9036775Z continue_on_error: false 2024-01-31T22:13:54.9037107Z env: 2024-01-31T22:13:54.9037375Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:54.9037825Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:54.9038466Z GITHUB_TOKEN: *** 2024-01-31T22:13:54.9038784Z ##[endgroup] 2024-01-31T22:13:54.9529906Z + python3 -m pip install requests==2.26.0 pyyaml==6.0.1 2024-01-31T22:13:55.1549592Z Defaulting to user installation because normal site-packages is not writeable 2024-01-31T22:13:55.1696625Z Requirement already satisfied: requests==2.26.0 in /home/ec2-user/.local/lib/python3.7/site-packages (2.26.0) 2024-01-31T22:13:55.1835401Z Requirement already satisfied: pyyaml==6.0.1 in /home/ec2-user/.local/lib/python3.7/site-packages (6.0.1) 2024-01-31T22:13:55.1843489Z Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.26.0) (1.26.18) 2024-01-31T22:13:55.2031549Z Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.26.0) (2023.11.17) 2024-01-31T22:13:55.2038692Z Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.26.0) (2.0.12) 2024-01-31T22:13:55.2059407Z Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.26.0) (3.6) 2024-01-31T22:13:55.9532916Z Command completed after 1 attempt(s). 2024-01-31T22:13:55.9576483Z ##[group]Run set -x 2024-01-31T22:13:55.9576816Z set -x 2024-01-31T22:13:55.9577111Z  2024-01-31T22:13:55.9577659Z # Use relative path here as this could be checked out anywhere, not necessarily 2024-01-31T22:13:55.9578329Z # in runner workspace 2024-01-31T22:13:55.9578857Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2024-01-31T22:13:55.9586849Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:55.9587345Z env: 2024-01-31T22:13:55.9587621Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:55.9588076Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:55.9588556Z ##[endgroup] 2024-01-31T22:13:55.9610491Z + python3 /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2024-01-31T22:13:55.9802237Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2024-01-31T22:13:55.9802786Z echo "Workflow: ${GITHUB_WORKFLOW}" 2024-01-31T22:13:55.9803248Z echo "Job name: ${JOB_NAME}" 2024-01-31T22:13:55.9803646Z  2024-01-31T22:13:55.9804192Z # Use relative path here as this could be checked out anywhere, not necessarily 2024-01-31T22:13:55.9805136Z # in runner workspace 2024-01-31T22:13:55.9805745Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2024-01-31T22:13:55.9806389Z  --workflow "${GITHUB_WORKFLOW}" \ 2024-01-31T22:13:55.9806847Z  --job-name "${JOB_NAME}" \ 2024-01-31T22:13:55.9808730Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" \ 2024-01-31T22:13:55.9810702Z  --pr-number "${PR_NUMBER}" \ 2024-01-31T22:13:55.9811131Z  --tag "${TAG}" \ 2024-01-31T22:13:55.9811530Z  --event-name "${EVENT_NAME}" \ 2024-01-31T22:13:55.9811974Z  --schedule "${SCHEDULE}" \ 2024-01-31T22:13:55.9812407Z  --branch "${HEAD_BRANCH}" 2024-01-31T22:13:55.9819919Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:55.9820420Z env: 2024-01-31T22:13:55.9820698Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:55.9821148Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:55.9821820Z GITHUB_TOKEN: *** 2024-01-31T22:13:55.9822546Z JOB_NAME: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:13:55.9823407Z PR_NUMBER: 2024-01-31T22:13:55.9823710Z TAG: ciflow/slow/118586 2024-01-31T22:13:55.9824057Z EVENT_NAME: push 2024-01-31T22:13:55.9824363Z SCHEDULE: 2024-01-31T22:13:55.9824643Z HEAD_BRANCH: 2024-01-31T22:13:55.9824940Z ##[endgroup] 2024-01-31T22:13:55.9856170Z Workflow: slow 2024-01-31T22:13:55.9857102Z Job name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:13:56.2948651Z fatal: unknown commit main 2024-01-31T22:13:56.2968136Z /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/filter_test_configs.py:472: UserWarning: failed to get commit messages: Command '['git', 'cherry', '-v', 'main']' returned non-zero exit status 128. 2024-01-31T22:13:56.2969690Z warnings.warn(f"failed to get commit messages: {e}") 2024-01-31T22:13:56.3067420Z ##[group]Run echo "Filtered matrix:" 2024-01-31T22:13:56.3067862Z echo "Filtered matrix:" 2024-01-31T22:13:56.3069788Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 4, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" 2024-01-31T22:13:56.3071634Z  2024-01-31T22:13:56.3071982Z echo 2024-01-31T22:13:56.3072368Z echo "Is the current job unstable? False" 2024-01-31T22:13:56.3072820Z  2024-01-31T22:13:56.3073210Z echo 2024-01-31T22:13:56.3073598Z echo "Is keep-going label set? True" 2024-01-31T22:13:56.3074025Z  2024-01-31T22:13:56.3074294Z echo 2024-01-31T22:13:56.3074613Z echo "Renabled issues? " 2024-01-31T22:13:56.3082283Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:56.3082778Z env: 2024-01-31T22:13:56.3083054Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:56.3083613Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:56.3084092Z ##[endgroup] 2024-01-31T22:13:56.3104382Z Filtered matrix: 2024-01-31T22:13:56.3106859Z {include: [{config: default, shard: 1, num_shards: 4, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 2, num_shards: 4, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 3, num_shards: 4, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 4, num_shards: 4, runner: linux.g5.4xlarge.nvidia.gpu}]} 2024-01-31T22:13:56.3108492Z 2024-01-31T22:13:56.3108643Z Is the current job unstable? False 2024-01-31T22:13:56.3108931Z 2024-01-31T22:13:56.3109218Z Is keep-going label set? True 2024-01-31T22:13:56.3109474Z 2024-01-31T22:13:56.3109603Z Renabled issues? 2024-01-31T22:13:56.3151485Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2024-01-31T22:13:56.3152196Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2024-01-31T22:13:56.3160017Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-01-31T22:13:56.3160513Z env: 2024-01-31T22:13:56.3160777Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:56.3161227Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:56.3161713Z JOB_TIMEOUT: 300 2024-01-31T22:13:56.3162007Z ##[endgroup] 2024-01-31T22:13:56.3288632Z ##[group]Run set -x 2024-01-31T22:13:56.3289031Z set -x 2024-01-31T22:13:56.3289325Z  2024-01-31T22:13:56.3289675Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2024-01-31T22:13:56.3290213Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2024-01-31T22:13:56.3290868Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2024-01-31T22:13:56.3291378Z  TEST_COMMAND=.ci/onnx/test.sh 2024-01-31T22:13:56.3291781Z else 2024-01-31T22:13:56.3292122Z  TEST_COMMAND=.ci/pytorch/test.sh 2024-01-31T22:13:56.3292550Z fi 2024-01-31T22:13:56.3292828Z  2024-01-31T22:13:56.3293290Z # detached container should get cleaned up by teardown_ec2_linux 2024-01-31T22:13:56.3294048Z # TODO: Stop building test binaries as part of the build phase 2024-01-31T22:13:56.3294719Z # Used for GPU_FLAG since that doesn't play nice 2024-01-31T22:13:56.3295297Z # shellcheck disable=SC2086,SC2090 2024-01-31T22:13:56.3295758Z container_name=$(docker run \ 2024-01-31T22:13:56.3296177Z  ${GPU_FLAG:-} \ 2024-01-31T22:13:56.3296551Z  -e BUILD_ENVIRONMENT \ 2024-01-31T22:13:56.3296938Z  -e PR_NUMBER \ 2024-01-31T22:13:56.3297293Z  -e GITHUB_ACTIONS \ 2024-01-31T22:13:56.3297681Z  -e GITHUB_REPOSITORY \ 2024-01-31T22:13:56.3298077Z  -e GITHUB_WORKFLOW \ 2024-01-31T22:13:56.3298457Z  -e GITHUB_JOB \ 2024-01-31T22:13:56.3298813Z  -e GITHUB_RUN_ID \ 2024-01-31T22:13:56.3299194Z  -e GITHUB_RUN_NUMBER \ 2024-01-31T22:13:56.3299723Z  -e GITHUB_RUN_ATTEMPT \ 2024-01-31T22:13:56.3300117Z  -e JOB_ID \ 2024-01-31T22:13:56.3300454Z  -e JOB_NAME \ 2024-01-31T22:13:56.3300792Z  -e BASE_SHA \ 2024-01-31T22:13:56.3301130Z  -e BRANCH \ 2024-01-31T22:13:56.3301460Z  -e SHA1 \ 2024-01-31T22:13:56.3301796Z  -e AWS_DEFAULT_REGION \ 2024-01-31T22:13:56.3302199Z  -e IN_WHEEL_TEST \ 2024-01-31T22:13:56.3302573Z  -e SHARD_NUMBER \ 2024-01-31T22:13:56.3302931Z  -e TEST_CONFIG \ 2024-01-31T22:13:56.3303295Z  -e NUM_TEST_SHARDS \ 2024-01-31T22:13:56.3303687Z  -e REENABLED_ISSUES \ 2024-01-31T22:13:56.3304098Z  -e CONTINUE_THROUGH_ERROR \ 2024-01-31T22:13:56.3304523Z  -e VERBOSE_TEST_LOGS \ 2024-01-31T22:13:56.3304919Z  -e NO_TEST_TIMEOUT \ 2024-01-31T22:13:56.3305293Z  -e PR_LABELS \ 2024-01-31T22:13:56.3305694Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2024-01-31T22:13:56.3306158Z  -e SCCACHE_BUCKET \ 2024-01-31T22:13:56.3306547Z  -e SCCACHE_S3_KEY_PREFIX \ 2024-01-31T22:13:56.3306948Z  -e XLA_CUDA \ 2024-01-31T22:13:56.3307347Z  -e XLA_CLANG_CACHE_S3_BUCKET_NAME \ 2024-01-31T22:13:56.3307849Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2024-01-31T22:13:56.3308375Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2024-01-31T22:13:56.3308891Z  -e SKIP_SCCACHE_INITIALIZATION=1 \ 2024-01-31T22:13:56.3309352Z  -e HUGGING_FACE_HUB_TOKEN \ 2024-01-31T22:13:56.3309764Z  -e DASHBOARD_TAG \ 2024-01-31T22:13:56.3310237Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2024-01-31T22:13:56.3310769Z  --ulimit stack=10485760:83886080 \ 2024-01-31T22:13:56.3311248Z  --security-opt seccomp=unconfined \ 2024-01-31T22:13:56.3311712Z  --cap-add=SYS_PTRACE \ 2024-01-31T22:13:56.3312107Z  --ipc=host \ 2024-01-31T22:13:56.3312467Z  --shm-size="${SHM_SIZE}" \ 2024-01-31T22:13:56.3312861Z  --tty \ 2024-01-31T22:13:56.3313202Z  --detach \ 2024-01-31T22:13:56.3313579Z  --name="${container_name}" \ 2024-01-31T22:13:56.3313997Z  --user jenkins \ 2024-01-31T22:13:56.3314553Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2024-01-31T22:13:56.3315114Z  -w /var/lib/jenkins/workspace \ 2024-01-31T22:13:56.3315554Z  "${DOCKER_IMAGE}" 2024-01-31T22:13:56.3315897Z ) 2024-01-31T22:13:56.3316294Z # Propagate download.pytorch.org IP to container 2024-01-31T22:13:56.3317218Z grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts" 2024-01-31T22:13:56.3318185Z echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}" 2024-01-31T22:13:56.3319084Z docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}" 2024-01-31T22:13:56.3326851Z shell: /usr/bin/bash -e {0} 2024-01-31T22:13:56.3327207Z env: 2024-01-31T22:13:56.3327484Z GIT_DEFAULT_BRANCH: main 2024-01-31T22:13:56.3327927Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:13:56.3328573Z BUILD_ENVIRONMENT: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:13:56.3329118Z PR_NUMBER: 2024-01-31T22:13:56.3329431Z GITHUB_REPOSITORY: pytorch/pytorch 2024-01-31T22:13:56.3329838Z GITHUB_WORKFLOW: slow 2024-01-31T22:13:56.3330170Z GITHUB_JOB: test 2024-01-31T22:13:56.3330568Z GITHUB_RUN_ID: 7731552918 2024-01-31T22:13:56.3330929Z GITHUB_RUN_NUMBER: 2791 2024-01-31T22:13:56.3331278Z GITHUB_RUN_ATTEMPT: 2 2024-01-31T22:13:56.3331595Z JOB_ID: 21083466405 2024-01-31T22:13:56.3332314Z JOB_NAME: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:13:56.3333100Z BRANCH: 2024-01-31T22:13:56.3333436Z SHA1: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:13:56.3334073Z BASE_SHA: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:13:56.3334537Z TEST_CONFIG: default 2024-01-31T22:13:56.3334859Z SHARD_NUMBER: 1 2024-01-31T22:13:56.3335162Z NUM_TEST_SHARDS: 4 2024-01-31T22:13:56.3335482Z REENABLED_ISSUES: 2024-01-31T22:13:56.3335809Z CONTINUE_THROUGH_ERROR: True 2024-01-31T22:13:56.3336195Z VERBOSE_TEST_LOGS: False 2024-01-31T22:13:56.3336547Z NO_TEST_TIMEOUT: False 2024-01-31T22:13:56.3336959Z SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2 2024-01-31T22:13:56.3337436Z SCCACHE_S3_KEY_PREFIX: slow 2024-01-31T22:13:56.3337795Z SHM_SIZE: 2g 2024-01-31T22:13:56.3338733Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:13:56.3339739Z XLA_CUDA: 2024-01-31T22:13:56.3340214Z XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla 2024-01-31T22:13:56.3340831Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2024-01-31T22:13:56.3341262Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 0 2024-01-31T22:13:56.3341683Z DASHBOARD_TAG: 2024-01-31T22:13:56.3342001Z HUGGING_FACE_HUB_TOKEN: 2024-01-31T22:13:56.3342333Z ##[endgroup] 2024-01-31T22:13:56.3361111Z + [[ default == \m\u\l\t\i\g\p\u ]] 2024-01-31T22:13:56.3361839Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *onnx* ]] 2024-01-31T22:13:56.3362409Z + TEST_COMMAND=.ci/pytorch/test.sh 2024-01-31T22:13:56.3367580Z +++ nproc --ignore=2 2024-01-31T22:13:56.3380259Z ++ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e NO_TEST_TIMEOUT -e PR_LABELS -e MAX_JOBS=14 -e SCCACHE_BUCKET -e SCCACHE_S3_KEY_PREFIX -e XLA_CUDA -e XLA_CLANG_CACHE_S3_BUCKET_NAME -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e SKIP_SCCACHE_INITIALIZATION=1 -e HUGGING_FACE_HUB_TOKEN -e DASHBOARD_TAG --env-file=/tmp/github_env_7731552918 --ulimit stack=10485760:83886080 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --shm-size=2g --tty --detach --name= --user jenkins -v /home/ec2-user/actions-runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-01-31T22:14:12.5671501Z + container_name=9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-01-31T22:14:12.5672556Z + grep download.pytorch.org /etc/hosts 2024-01-31T22:14:12.5673749Z + docker exec -i 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd sudo bash -c '/bin/cat >> /etc/hosts' 2024-01-31T22:14:12.6238723Z sudo: setrlimit(RLIMIT_STACK): Operation not permitted 2024-01-31T22:14:12.6283507Z + echo DOCKER_CONTAINER_ID=9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-01-31T22:14:12.6285855Z ++ echo dist/torch-2.3.0a0+git86d2ab3-cp310-cp310-linux_x86_64.whl 2024-01-31T22:14:12.6287880Z + docker exec -t 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd sh -c 'pip install dist/torch-2.3.0a0+git86d2ab3-cp310-cp310-linux_x86_64.whl[opt-einsum] && .ci/pytorch/test.sh' 2024-01-31T22:14:12.9972570Z Processing ./dist/torch-2.3.0a0+git86d2ab3-cp310-cp310-linux_x86_64.whl 2024-01-31T22:14:13.2969586Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (3.9.0) 2024-01-31T22:14:13.2974038Z Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (4.9.0) 2024-01-31T22:14:13.2976265Z Requirement already satisfied: sympy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (1.12) 2024-01-31T22:14:13.2979467Z Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (2.8.8) 2024-01-31T22:14:13.2982430Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (3.1.2) 2024-01-31T22:14:13.2985258Z Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (2023.4.0) 2024-01-31T22:14:13.2995948Z Requirement already satisfied: opt-einsum>=3.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.3.0a0+git86d2ab3) (3.3.0) 2024-01-31T22:14:13.3054072Z Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from opt-einsum>=3.3->torch==2.3.0a0+git86d2ab3) (1.21.2) 2024-01-31T22:14:13.3450349Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.3.0a0+git86d2ab3) (2.1.4) 2024-01-31T22:14:13.3603073Z Requirement already satisfied: mpmath>=0.19 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy->torch==2.3.0a0+git86d2ab3) (1.3.0) 2024-01-31T22:14:14.0794863Z Installing collected packages: torch 2024-01-31T22:14:22.0511642Z Successfully installed torch-2.3.0a0+git86d2ab3 2024-01-31T22:14:22.1124646Z + echo 'Environment variables:' 2024-01-31T22:14:22.1125201Z Environment variables: 2024-01-31T22:14:22.1125514Z + env 2024-01-31T22:14:22.1130004Z INSTALLED_DB=yes 2024-01-31T22:14:22.1130577Z NV_LIBCUBLAS_VERSION=12.1.3.1-1 2024-01-31T22:14:22.1131036Z NVIDIA_VISIBLE_DEVICES=all 2024-01-31T22:14:22.1131587Z NV_NVML_DEV_VERSION=12.1.105-1 2024-01-31T22:14:22.1132377Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-01-31T22:14:22.1134830Z CONTINUE_THROUGH_ERROR=True 2024-01-31T22:14:22.1135380Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1 2024-01-31T22:14:22.1135938Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 2024-01-31T22:14:22.1136751Z BUILD_ENVIRONMENT=linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:14:22.1137411Z HOSTNAME=9c1b86787841 2024-01-31T22:14:22.1138534Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1139965Z GITHUB_ACTION=__self 2024-01-31T22:14:22.1140458Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2024-01-31T22:14:22.1144847Z NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 2024-01-31T22:14:22.1150051Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.3.1-1 2024-01-31T22:14:22.1150732Z NV_NVTX_VERSION=12.1.105-1 2024-01-31T22:14:22.1151204Z GITHUB_RUN_NUMBER=2791 2024-01-31T22:14:22.1151543Z TEST_CONFIG=default 2024-01-31T22:14:22.1151887Z GITHUB_REPOSITORY_OWNER_ID=21003710 2024-01-31T22:14:22.1152364Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2024-01-31T22:14:22.1152868Z NV_CUDA_CUDART_DEV_VERSION=12.1.105-1 2024-01-31T22:14:22.1153470Z NV_LIBCUSPARSE_VERSION=12.1.0.106-1 2024-01-31T22:14:22.1153900Z NV_LIBNPP_VERSION=12.1.0.40-1 2024-01-31T22:14:22.1154277Z GITHUB_TRIGGERING_ACTOR=clee2000 2024-01-31T22:14:22.1154732Z CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache 2024-01-31T22:14:22.1155228Z GITHUB_REF_TYPE=tag 2024-01-31T22:14:22.1155694Z TORCH_CUDA_ARCH_LIST=Maxwell 2024-01-31T22:14:22.1156074Z NCCL_VERSION=2.17.1-1 2024-01-31T22:14:22.1156466Z BASE_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1156910Z XLA_CUDA= 2024-01-31T22:14:22.1157190Z HUGGING_FACE_HUB_TOKEN= 2024-01-31T22:14:22.1158808Z *** 2024-01-31T22:14:22.1159106Z CARGO_NET_GIT_FETCH_WITH_CLI=true 2024-01-31T22:14:22.1159504Z GITHUB_REPOSITORY_ID=65600975 2024-01-31T22:14:22.1159870Z GITHUB_ACTIONS=true 2024-01-31T22:14:22.1160202Z NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:14:22.1160751Z NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.105-1 2024-01-31T22:14:22.1161279Z NV_LIBNPP_PACKAGE=libnpp-12-1=12.1.0.40-1 2024-01-31T22:14:22.1161752Z SHA1=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1162255Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2024-01-31T22:14:22.1162741Z GITHUB_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1163496Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/slow.yml@refs/tags/ciflow/slow/118586 2024-01-31T22:14:22.1164160Z UCC_HOME=/usr 2024-01-31T22:14:22.1164491Z NV_LIBCUBLAS_DEV_VERSION=12.1.3.1-1 2024-01-31T22:14:22.1165124Z VERBOSE_TEST_LOGS=False 2024-01-31T22:14:22.1165476Z NVIDIA_PRODUCT_NAME=CUDA 2024-01-31T22:14:22.1166021Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1 2024-01-31T22:14:22.1166508Z GITHUB_REF=refs/tags/ciflow/slow/118586 2024-01-31T22:14:22.1166950Z NV_CUDA_CUDART_VERSION=12.1.105-1 2024-01-31T22:14:22.1167325Z SHARD_NUMBER=1 2024-01-31T22:14:22.1167628Z GITHUB_REF_PROTECTED=false 2024-01-31T22:14:22.1167967Z HOME=/var/lib/jenkins 2024-01-31T22:14:22.1168331Z GITHUB_API_URL=https://api.github.com 2024-01-31T22:14:22.1168769Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2024-01-31T22:14:22.1169243Z UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af 2024-01-31T22:14:22.1169709Z SCCACHE_S3_KEY_PREFIX=slow 2024-01-31T22:14:22.1170050Z CUDA_VERSION=12.1.1 2024-01-31T22:14:22.1170481Z NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.3.1-1 2024-01-31T22:14:22.1170946Z NUM_TEST_SHARDS=4 2024-01-31T22:14:22.1171242Z UCX_HOME=/usr 2024-01-31T22:14:22.1171735Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.1-1 2024-01-31T22:14:22.1172790Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1174194Z JOB_NAME=linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:14:22.1175474Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1176563Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2024-01-31T22:14:22.1177218Z GITHUB_EVENT_NAME=push 2024-01-31T22:14:22.1177541Z DASHBOARD_TAG= 2024-01-31T22:14:22.1177829Z GITHUB_RUN_ID=7731552918 2024-01-31T22:14:22.1178278Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.1.0.40-1 2024-01-31T22:14:22.1178809Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1 2024-01-31T22:14:22.1179791Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1180683Z GITHUB_ACTOR=pytorch-bot[bot] 2024-01-31T22:14:22.1181088Z NV_LIBNPP_DEV_VERSION=12.1.0.40-1 2024-01-31T22:14:22.1181452Z PR_NUMBER= 2024-01-31T22:14:22.1181730Z GITHUB_RUN_ATTEMPT=2 2024-01-31T22:14:22.1182069Z ANACONDA_PYTHON_VERSION=3.10 2024-01-31T22:14:22.1182502Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2024-01-31T22:14:22.1182955Z TERM=xterm 2024-01-31T22:14:22.1183278Z NV_LIBCUSPARSE_DEV_VERSION=12.1.0.106-1 2024-01-31T22:14:22.1183687Z INSTALLED_VISION=yes 2024-01-31T22:14:22.1183996Z BRANCH= 2024-01-31T22:14:22.1184277Z OPENSSL_ROOT_DIR=/opt/openssl 2024-01-31T22:14:22.1184666Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2024-01-31T22:14:22.1185085Z CUDA_PATH=/usr/local/cuda 2024-01-31T22:14:22.1185824Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2024-01-31T22:14:22.1186641Z GITHUB_SERVER_URL=https://github.com 2024-01-31T22:14:22.1187127Z UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea 2024-01-31T22:14:22.1187588Z REENABLED_ISSUES= 2024-01-31T22:14:22.1187872Z SHLVL=1 2024-01-31T22:14:22.1188128Z MAX_JOBS=14 2024-01-31T22:14:22.1188436Z NV_CUDA_LIB_VERSION=12.1.1-1 2024-01-31T22:14:22.1188779Z NVARCH=x86_64 2024-01-31T22:14:22.1189081Z GITHUB_ACTOR_ID=54816060 2024-01-31T22:14:22.1189541Z GITHUB_WORKFLOW_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1190061Z GITHUB_REF_NAME=ciflow/slow/118586 2024-01-31T22:14:22.1190518Z NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1 2024-01-31T22:14:22.1191174Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2024-01-31T22:14:22.1191733Z GITHUB_JOB=test 2024-01-31T22:14:22.1192125Z NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1 2024-01-31T22:14:22.1192682Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2024-01-31T22:14:22.1193190Z NO_TEST_TIMEOUT=False 2024-01-31T22:14:22.1193572Z NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.1-1 2024-01-31T22:14:22.1194009Z GITHUB_REPOSITORY=pytorch/pytorch 2024-01-31T22:14:22.1194416Z NV_NVPROF_VERSION=12.1.105-1 2024-01-31T22:14:22.1194776Z GITHUB_RETENTION_DAYS=90 2024-01-31T22:14:22.1195120Z OPENSSL_DIR=/opt/openssl 2024-01-31T22:14:22.1195458Z GITHUB_ACTION_REPOSITORY= 2024-01-31T22:14:22.1196497Z PATH=/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-01-31T22:14:22.1197561Z GITHUB_BASE_REF= 2024-01-31T22:14:22.1197873Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2024-01-31T22:14:22.1198235Z CI=true 2024-01-31T22:14:22.1198540Z NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 2024-01-31T22:14:22.1198941Z GITHUB_REPOSITORY_OWNER=pytorch 2024-01-31T22:14:22.1199309Z JOB_ID=21083466405 2024-01-31T22:14:22.1199613Z INSTALLED_PROTOBUF=yes 2024-01-31T22:14:22.1199931Z GITHUB_HEAD_REF= 2024-01-31T22:14:22.1200230Z GITHUB_ACTION_REF= 2024-01-31T22:14:22.1200683Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2024-01-31T22:14:22.1201168Z GITHUB_WORKFLOW=slow 2024-01-31T22:14:22.1201501Z DEBIAN_FRONTEND=noninteractive 2024-01-31T22:14:22.1202383Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1203202Z SKIP_SCCACHE_INITIALIZATION=1 2024-01-31T22:14:22.1203646Z _=/usr/bin/env 2024-01-31T22:14:22.1204130Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2024-01-31T22:14:22.1270255Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2024-01-31T22:14:22.1271384Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2024-01-31T22:14:22.1272251Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2024-01-31T22:14:22.1273316Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2024-01-31T22:14:22.1273922Z + BUILD_DIR=build 2024-01-31T22:14:22.1274250Z + BUILD_RENAMED_DIR=build_renamed 2024-01-31T22:14:22.1274654Z + BUILD_BIN_DIR=build/bin 2024-01-31T22:14:22.1274987Z + SHARD_NUMBER=1 2024-01-31T22:14:22.1275304Z + NUM_TEST_SHARDS=4 2024-01-31T22:14:22.1275665Z + export VALGRIND=ON 2024-01-31T22:14:22.1276010Z + VALGRIND=ON 2024-01-31T22:14:22.1276535Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *clang9* ]] 2024-01-31T22:14:22.1277081Z + [[ 0 == \1 ]] 2024-01-31T22:14:22.1277366Z + [[ True == \1 ]] 2024-01-31T22:14:22.1277871Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck != *bazel* ]] 2024-01-31T22:14:22.1278438Z ++ realpath build/custom_test_artifacts 2024-01-31T22:14:22.1281869Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/workspace/build/custom_test_artifacts 2024-01-31T22:14:22.1282560Z + [[ -n '' ]] 2024-01-31T22:14:22.1283566Z ++ dirname .ci/pytorch/test.sh 2024-01-31T22:14:22.1288886Z + source .ci/pytorch/common.sh 2024-01-31T22:14:22.1290653Z +++ dirname .ci/pytorch/common.sh 2024-01-31T22:14:22.1295381Z ++ source .ci/pytorch/common_utils.sh 2024-01-31T22:14:22.1297010Z +++ declare -f -t trap_add 2024-01-31T22:14:22.1302138Z ++ set -ex 2024-01-31T22:14:22.1302754Z ++ [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *rocm* ]] 2024-01-31T22:14:22.1303297Z ++ BUILD_TEST_LIBTORCH=0 2024-01-31T22:14:22.1303680Z + echo 'Environment variables' 2024-01-31T22:14:22.1304056Z Environment variables 2024-01-31T22:14:22.1304363Z + env 2024-01-31T22:14:22.1307855Z INSTALLED_DB=yes 2024-01-31T22:14:22.1308400Z NV_LIBCUBLAS_VERSION=12.1.3.1-1 2024-01-31T22:14:22.1309006Z NVIDIA_VISIBLE_DEVICES=all 2024-01-31T22:14:22.1309474Z NV_NVML_DEV_VERSION=12.1.105-1 2024-01-31T22:14:22.1310112Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-01-31T22:14:22.1310703Z CONTINUE_THROUGH_ERROR=True 2024-01-31T22:14:22.1311245Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1 2024-01-31T22:14:22.1311787Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 2024-01-31T22:14:22.1312451Z BUILD_ENVIRONMENT=linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-01-31T22:14:22.1312987Z HOSTNAME=9c1b86787841 2024-01-31T22:14:22.1313916Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1314803Z GITHUB_ACTION=__self 2024-01-31T22:14:22.1315212Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2024-01-31T22:14:22.1319288Z NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 2024-01-31T22:14:22.1323199Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.3.1-1 2024-01-31T22:14:22.1323696Z NV_NVTX_VERSION=12.1.105-1 2024-01-31T22:14:22.1324048Z GITHUB_RUN_NUMBER=2791 2024-01-31T22:14:22.1324381Z TEST_CONFIG=default 2024-01-31T22:14:22.1324710Z GITHUB_REPOSITORY_OWNER_ID=21003710 2024-01-31T22:14:22.1325557Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2024-01-31T22:14:22.1326116Z NV_CUDA_CUDART_DEV_VERSION=12.1.105-1 2024-01-31T22:14:22.1326548Z NV_LIBCUSPARSE_VERSION=12.1.0.106-1 2024-01-31T22:14:22.1326966Z NV_LIBNPP_VERSION=12.1.0.40-1 2024-01-31T22:14:22.1327349Z GITHUB_TRIGGERING_ACTOR=clee2000 2024-01-31T22:14:22.1327789Z CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache 2024-01-31T22:14:22.1328258Z GITHUB_REF_TYPE=tag 2024-01-31T22:14:22.1328582Z TORCH_CUDA_ARCH_LIST=Maxwell 2024-01-31T22:14:22.1328946Z NCCL_VERSION=2.17.1-1 2024-01-31T22:14:22.1329343Z BASE_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1329790Z XLA_CUDA= 2024-01-31T22:14:22.1330062Z HUGGING_FACE_HUB_TOKEN= 2024-01-31T22:14:22.1330475Z *** 2024-01-31T22:14:22.1330789Z CARGO_NET_GIT_FETCH_WITH_CLI=true 2024-01-31T22:14:22.1331176Z GITHUB_REPOSITORY_ID=65600975 2024-01-31T22:14:22.1331541Z GITHUB_ACTIONS=true 2024-01-31T22:14:22.1331876Z NVIDIA_DRIVER_CAPABILITIES=all 2024-01-31T22:14:22.1332358Z NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.105-1 2024-01-31T22:14:22.1332897Z NV_LIBNPP_PACKAGE=libnpp-12-1=12.1.0.40-1 2024-01-31T22:14:22.1333372Z SHA1=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1333873Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2024-01-31T22:14:22.1334363Z GITHUB_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1335112Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/slow.yml@refs/tags/ciflow/slow/118586 2024-01-31T22:14:22.1335769Z UCC_HOME=/usr 2024-01-31T22:14:22.1336107Z NV_LIBCUBLAS_DEV_VERSION=12.1.3.1-1 2024-01-31T22:14:22.1336505Z VERBOSE_TEST_LOGS=False 2024-01-31T22:14:22.1336920Z NVIDIA_PRODUCT_NAME=CUDA 2024-01-31T22:14:22.1337379Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1 2024-01-31T22:14:22.1337870Z GITHUB_REF=refs/tags/ciflow/slow/118586 2024-01-31T22:14:22.1338314Z NV_CUDA_CUDART_VERSION=12.1.105-1 2024-01-31T22:14:22.1338689Z SHARD_NUMBER=1 2024-01-31T22:14:22.1339006Z GITHUB_REF_PROTECTED=false 2024-01-31T22:14:22.1339349Z HOME=/var/lib/jenkins 2024-01-31T22:14:22.1339714Z GITHUB_API_URL=https://api.github.com 2024-01-31T22:14:22.1340153Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2024-01-31T22:14:22.1340625Z UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af 2024-01-31T22:14:22.1341101Z SCCACHE_S3_KEY_PREFIX=slow 2024-01-31T22:14:22.1341453Z CUDA_VERSION=12.1.1 2024-01-31T22:14:22.1341864Z NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.3.1-1 2024-01-31T22:14:22.1342311Z NUM_TEST_SHARDS=4 2024-01-31T22:14:22.1342607Z UCX_HOME=/usr 2024-01-31T22:14:22.1343104Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.1-1 2024-01-31T22:14:22.1344173Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1345474Z JOB_NAME=linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 1, 4, linux.g5.4xlarge.nvidia.gpu) 2024-01-31T22:14:22.1346753Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1347850Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2024-01-31T22:14:22.1348514Z GITHUB_EVENT_NAME=push 2024-01-31T22:14:22.1348841Z DASHBOARD_TAG= 2024-01-31T22:14:22.1349134Z GITHUB_RUN_ID=7731552918 2024-01-31T22:14:22.1349582Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.1.0.40-1 2024-01-31T22:14:22.1350116Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1 2024-01-31T22:14:22.1351143Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1352041Z GITHUB_ACTOR=pytorch-bot[bot] 2024-01-31T22:14:22.1352458Z NV_LIBNPP_DEV_VERSION=12.1.0.40-1 2024-01-31T22:14:22.1352830Z PR_NUMBER= 2024-01-31T22:14:22.1353110Z GITHUB_RUN_ATTEMPT=2 2024-01-31T22:14:22.1353432Z VALGRIND=ON 2024-01-31T22:14:22.1353728Z ANACONDA_PYTHON_VERSION=3.10 2024-01-31T22:14:22.1354172Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2024-01-31T22:14:22.1354632Z TERM=xterm 2024-01-31T22:14:22.1355024Z NV_LIBCUSPARSE_DEV_VERSION=12.1.0.106-1 2024-01-31T22:14:22.1355441Z INSTALLED_VISION=yes 2024-01-31T22:14:22.1355751Z BRANCH= 2024-01-31T22:14:22.1356022Z OPENSSL_ROOT_DIR=/opt/openssl 2024-01-31T22:14:22.1356425Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2024-01-31T22:14:22.1356848Z CUDA_PATH=/usr/local/cuda 2024-01-31T22:14:22.1357587Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2024-01-31T22:14:22.1358329Z GITHUB_SERVER_URL=https://github.com 2024-01-31T22:14:22.1358823Z UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea 2024-01-31T22:14:22.1359290Z REENABLED_ISSUES= 2024-01-31T22:14:22.1359582Z SHLVL=1 2024-01-31T22:14:22.1359845Z MAX_JOBS=14 2024-01-31T22:14:22.1360146Z NV_CUDA_LIB_VERSION=12.1.1-1 2024-01-31T22:14:22.1360513Z NVARCH=x86_64 2024-01-31T22:14:22.1360845Z GITHUB_ACTOR_ID=54816060 2024-01-31T22:14:22.1361298Z GITHUB_WORKFLOW_SHA=86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-01-31T22:14:22.1361836Z GITHUB_REF_NAME=ciflow/slow/118586 2024-01-31T22:14:22.1362307Z NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1 2024-01-31T22:14:22.1362956Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2024-01-31T22:14:22.1363524Z GITHUB_JOB=test 2024-01-31T22:14:22.1363922Z NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1 2024-01-31T22:14:22.1364471Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2024-01-31T22:14:22.1365195Z NO_TEST_TIMEOUT=False 2024-01-31T22:14:22.1365584Z NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.1-1 2024-01-31T22:14:22.1366066Z GITHUB_REPOSITORY=pytorch/pytorch 2024-01-31T22:14:22.1366480Z NV_NVPROF_VERSION=12.1.105-1 2024-01-31T22:14:22.1366934Z GITHUB_RETENTION_DAYS=90 2024-01-31T22:14:22.1367277Z OPENSSL_DIR=/opt/openssl 2024-01-31T22:14:22.1367630Z GITHUB_ACTION_REPOSITORY= 2024-01-31T22:14:22.1368662Z PATH=/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-01-31T22:14:22.1369731Z GITHUB_BASE_REF= 2024-01-31T22:14:22.1370059Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2024-01-31T22:14:22.1370424Z CI=true 2024-01-31T22:14:22.1370731Z NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 2024-01-31T22:14:22.1371147Z GITHUB_REPOSITORY_OWNER=pytorch 2024-01-31T22:14:22.1371522Z JOB_ID=21083466405 2024-01-31T22:14:22.1371828Z INSTALLED_PROTOBUF=yes 2024-01-31T22:14:22.1372152Z GITHUB_HEAD_REF= 2024-01-31T22:14:22.1372453Z GITHUB_ACTION_REF= 2024-01-31T22:14:22.1372869Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2024-01-31T22:14:22.1373331Z GITHUB_WORKFLOW=slow 2024-01-31T22:14:22.1373674Z DEBIAN_FRONTEND=noninteractive 2024-01-31T22:14:22.1374569Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_5800cb92-7647-46c8-9986-5de1099ac23c 2024-01-31T22:14:22.1375405Z SKIP_SCCACHE_INITIALIZATION=1 2024-01-31T22:14:22.1375769Z _=/usr/bin/env 2024-01-31T22:14:22.1376097Z + echo 'Testing pytorch' 2024-01-31T22:14:22.1376437Z Testing pytorch 2024-01-31T22:14:22.1376760Z + export LANG=C.UTF-8 2024-01-31T22:14:22.1377087Z + LANG=C.UTF-8 2024-01-31T22:14:22.1377389Z + PR_NUMBER= 2024-01-31T22:14:22.1377699Z + [[ default == \d\e\f\a\u\l\t ]] 2024-01-31T22:14:22.1378094Z + export CUDA_VISIBLE_DEVICES=0 2024-01-31T22:14:22.1378481Z + CUDA_VISIBLE_DEVICES=0 2024-01-31T22:14:22.1378841Z + export HIP_VISIBLE_DEVICES=0 2024-01-31T22:14:22.1379206Z + HIP_VISIBLE_DEVICES=0 2024-01-31T22:14:22.1379572Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2024-01-31T22:14:22.1379995Z + [[ default == \s\l\o\w ]] 2024-01-31T22:14:22.1380613Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *slow-gradcheck* ]] 2024-01-31T22:14:22.1381270Z + export PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 2024-01-31T22:14:22.1381738Z + PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 2024-01-31T22:14:22.1382179Z + export PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 2024-01-31T22:14:22.1382632Z + PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 2024-01-31T22:14:22.1383235Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *cuda* ]] 2024-01-31T22:14:22.1383920Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2024-01-31T22:14:22.1384391Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2024-01-31T22:14:22.1384813Z + [[ default == *crossref* ]] 2024-01-31T22:14:22.1385377Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *rocm* ]] 2024-01-31T22:14:22.1386098Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *xpu* ]] 2024-01-31T22:14:22.1386838Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck != *-bazel-* ]] 2024-01-31T22:14:22.1387446Z + pip_install --user ninja==1.10.2 2024-01-31T22:14:22.1387971Z + pip install --progress-bar off --user ninja==1.10.2 2024-01-31T22:14:22.4992817Z Collecting ninja==1.10.2 2024-01-31T22:14:22.5386809Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2024-01-31T22:14:23.2710214Z Installing collected packages: ninja 2024-01-31T22:14:23.2773090Z  WARNING: The script ninja is installed in '/var/lib/jenkins/.local/bin' which is not on PATH. 2024-01-31T22:14:23.2774369Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2024-01-31T22:14:23.2817299Z Successfully installed ninja-1.10.2 2024-01-31T22:14:23.3483517Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-01-31T22:14:23.3486341Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-01-31T22:14:23.3488208Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *asan* ]] 2024-01-31T22:14:23.3488979Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *-debug* ]] 2024-01-31T22:14:23.3489727Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck != *-bazel-* ]] 2024-01-31T22:14:23.3490770Z + echo 'We are not in debug mode: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck. Expect the assertion to pass' 2024-01-31T22:14:23.3492007Z We are not in debug mode: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck. Expect the assertion to pass 2024-01-31T22:14:23.3492753Z + cd test 2024-01-31T22:14:23.3493265Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2024-01-31T22:14:24.5581232Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2024-01-31T22:14:24.5581766Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2024-01-31T22:14:24.5582210Z + DYNAMO_BENCHMARK_FLAGS=() 2024-01-31T22:14:24.5582611Z + [[ default == *dynamo_eager* ]] 2024-01-31T22:14:24.5583019Z + [[ default == *aot_eager* ]] 2024-01-31T22:14:24.5583529Z + [[ default == *aot_inductor* ]] 2024-01-31T22:14:24.5584016Z + [[ default == *inductor* ]] 2024-01-31T22:14:24.5584472Z + [[ default == *dynamic* ]] 2024-01-31T22:14:24.5584862Z + [[ default == *cpu_inductor* ]] 2024-01-31T22:14:24.5585419Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2024-01-31T22:14:24.5592719Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *tbb* ]] 2024-01-31T22:14:24.5605531Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *libtorch* ]] 2024-01-31T22:14:24.5606390Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *-bazel-* ]] 2024-01-31T22:14:24.5608398Z + cd test 2024-01-31T22:14:24.5608861Z + python -c 'import torch; print(torch.__config__.show())' 2024-01-31T22:14:25.6792792Z PyTorch built with: 2024-01-31T22:14:25.6793226Z - GCC 9.4 2024-01-31T22:14:25.6793668Z - C++ Version: 201703 2024-01-31T22:14:25.6794905Z - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications 2024-01-31T22:14:25.6796116Z - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01) 2024-01-31T22:14:25.6796783Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2024-01-31T22:14:25.6797287Z - LAPACK is enabled (usually provided by MKL) 2024-01-31T22:14:25.6797774Z - NNPACK is enabled 2024-01-31T22:14:25.6798154Z - CPU capability usage: AVX2 2024-01-31T22:14:25.6798551Z - CUDA Runtime 12.1 2024-01-31T22:14:25.6799247Z - NVCC architecture flags: -gencode;arch=compute_86,code=sm_86 2024-01-31T22:14:25.6799801Z - CuDNN 8.9.2 2024-01-31T22:14:25.6800115Z - Magma 2.6.1 2024-01-31T22:14:25.6807428Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 2024-01-31T22:14:25.6814070Z 2024-01-31T22:14:25.9178763Z + cd test 2024-01-31T22:14:26.9153812Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2024-01-31T22:14:26.9155012Z ATen/Parallel: 2024-01-31T22:14:26.9155519Z at::get_num_threads() : 8 2024-01-31T22:14:26.9155979Z at::get_num_interop_threads() : 16 2024-01-31T22:14:26.9156436Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2024-01-31T22:14:26.9156830Z omp_get_max_threads() : 8 2024-01-31T22:14:26.9157704Z Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications 2024-01-31T22:14:26.9158499Z mkl_get_max_threads() : 8 2024-01-31T22:14:26.9159125Z Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01) 2024-01-31T22:14:26.9159751Z std::thread::hardware_concurrency() : 16 2024-01-31T22:14:26.9160180Z Environment variables: 2024-01-31T22:14:26.9160527Z OMP_NUM_THREADS : [not set] 2024-01-31T22:14:26.9160888Z MKL_NUM_THREADS : [not set] 2024-01-31T22:14:26.9161258Z ATen parallel backend: OpenMP 2024-01-31T22:14:26.9161511Z 2024-01-31T22:14:27.1221287Z + [[ default == *backward* ]] 2024-01-31T22:14:27.1221731Z + [[ default == *xla* ]] 2024-01-31T22:14:27.1222218Z + [[ default == *executorch* ]] 2024-01-31T22:14:27.1222685Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2024-01-31T22:14:27.1223351Z + [[ linux-focal-cuda12.1-py3-gcc9-slow-gradcheck == *libtorch* ]] 2024-01-31T22:14:27.1223932Z + [[ default == distributed ]] 2024-01-31T22:14:27.1224309Z + [[ default == deploy ]] 2024-01-31T22:14:27.1224700Z + [[ default == *inductor_distributed* ]] 2024-01-31T22:14:27.1225131Z + [[ default == *huggingface* ]] 2024-01-31T22:14:27.1225515Z + [[ default == *timm* ]] 2024-01-31T22:14:27.1225868Z + [[ default == *torchbench* ]] 2024-01-31T22:14:27.1226246Z + [[ default == *inductor* ]] 2024-01-31T22:14:27.1226610Z + [[ default == *dynamo* ]] 2024-01-31T22:14:27.1226960Z + [[ default == *dynamo* ]] 2024-01-31T22:14:27.1227298Z + [[ 1 == 1 ]] 2024-01-31T22:14:27.1227613Z + [[ 4 -gt 1 ]] 2024-01-31T22:14:27.1227900Z + test_without_numpy 2024-01-31T22:14:27.1228244Z ++ dirname .ci/pytorch/test.sh 2024-01-31T22:14:27.1232984Z + pushd .ci/pytorch 2024-01-31T22:14:27.1233355Z ~/workspace/.ci/pytorch ~/workspace 2024-01-31T22:14:27.1234569Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');from unittest import TestCase;import torch;x=torch.randn(3,3);TestCase().assertRaises(RuntimeError, lambda: x.numpy())' 2024-01-31T22:14:28.0626740Z :1: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2024-01-31T22:14:28.2574287Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))' 2024-01-31T22:14:29.2128430Z :1: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2024-01-31T22:14:29.2137093Z tensor([0., 1.]) 2024-01-31T22:14:29.4125688Z + [[ default == *dynamo* ]] 2024-01-31T22:14:29.4126559Z + popd 2024-01-31T22:14:29.4127234Z ~/workspace 2024-01-31T22:14:29.4127812Z + install_torchvision 2024-01-31T22:14:29.4128459Z + local orig_preload 2024-01-31T22:14:29.4129078Z + local commit 2024-01-31T22:14:29.4129686Z ++ get_pinned_commit vision 2024-01-31T22:14:29.4130449Z ++ cat .github/ci_commit_pins/vision.txt 2024-01-31T22:14:29.4139190Z + commit=315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:29.4139687Z + orig_preload= 2024-01-31T22:14:29.4140185Z + '[' -n '' ']' 2024-01-31T22:14:29.4141025Z + pip_install --no-use-pep517 --user git+https://github.com/pytorch/vision.git@315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:29.4142454Z + pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/vision.git@315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:29.7167198Z Collecting git+https://github.com/pytorch/vision.git@315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:29.7171700Z Cloning https://github.com/pytorch/vision.git (to revision 315f31527e720999eecbb986679b3177d4ed5e37) to /tmp/pip-req-build-jc2zbofw 2024-01-31T22:14:29.7188557Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-jc2zbofw 2024-01-31T22:14:31.2709531Z Running command git rev-parse -q --verify 'sha^315f31527e720999eecbb986679b3177d4ed5e37' 2024-01-31T22:14:31.2723662Z Running command git fetch -q https://github.com/pytorch/vision.git 315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:32.4088452Z Running command git checkout -q 315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:32.6467795Z Resolved https://github.com/pytorch/vision.git to commit 315f31527e720999eecbb986679b3177d4ed5e37 2024-01-31T22:14:34.4287613Z Preparing metadata (setup.py) ... [?25l- \ done 2024-01-31T22:14:34.4334367Z [?25hRequirement already satisfied: numpy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.18.0a0+315f315) (1.21.2) 2024-01-31T22:14:34.4337095Z Requirement already satisfied: requests in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.18.0a0+315f315) (2.31.0) 2024-01-31T22:14:34.4340271Z Requirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.18.0a0+315f315) (2.3.0a0+git86d2ab3) 2024-01-31T22:14:34.4345618Z Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.18.0a0+315f315) (10.0.1) 2024-01-31T22:14:34.4510707Z Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from requests->torchvision==0.18.0a0+315f315) (3.3.2) 2024-01-31T22:14:34.4514940Z Requirement already satisfied: idna<4,>=2.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from requests->torchvision==0.18.0a0+315f315) (3.6) 2024-01-31T22:14:34.4521023Z Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from requests->torchvision==0.18.0a0+315f315) (1.26.18) 2024-01-31T22:14:34.4526481Z Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from requests->torchvision==0.18.0a0+315f315) (2023.11.17) 2024-01-31T22:14:34.4577428Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (3.9.0) 2024-01-31T22:14:34.4582288Z Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (4.9.0) 2024-01-31T22:14:34.4585437Z Requirement already satisfied: sympy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (1.12) 2024-01-31T22:14:34.4588543Z Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (2.8.8) 2024-01-31T22:14:34.4591684Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (3.1.2) 2024-01-31T22:14:34.4594749Z Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.18.0a0+315f315) (2023.4.0) 2024-01-31T22:14:34.5154054Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->torchvision==0.18.0a0+315f315) (2.1.4) 2024-01-31T22:14:34.5310961Z Requirement already satisfied: mpmath>=0.19 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy->torch->torchvision==0.18.0a0+315f315) (1.3.0) 2024-01-31T22:14:34.5390076Z Building wheels for collected packages: torchvision 2024-01-31T22:15:45.8211817Z Building wheel for torchvision (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2024-01-31T22:15:45.8239708Z [?25h Created wheel for torchvision: filename=torchvision-0.18.0a0+315f315-cp310-cp310-linux_x86_64.whl size=2069715 sha256=c66128ffde24bcc55c9999cebdd78e742fc157f4227c1345860a9167426ce5ab 2024-01-31T22:15:45.8242451Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/6e/31/53/4e04f238fd46af402e948eb59734fb20b06062565136f8da88 2024-01-31T22:15:45.8276444Z Successfully built torchvision 2024-01-31T22:15:46.4428691Z Installing collected packages: torchvision 2024-01-31T22:15:46.8167730Z Successfully installed torchvision-0.18.0a0+315f315 2024-01-31T22:15:46.9117478Z + '[' -n '' ']' 2024-01-31T22:15:46.9117846Z + test_python_shard 1 2024-01-31T22:15:46.9120166Z + [[ -z 4 ]] 2024-01-31T22:15:46.9121120Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 1 4 --verbose 2024-01-31T22:15:46.9865077Z /var/lib/jenkins/workspace/test/run_test.py:21: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html 2024-01-31T22:15:46.9867293Z import pkg_resources 2024-01-31T22:15:48.2988409Z Excluding doctests Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2991119Z Excluding test_meta Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2993186Z Excluding test_hub Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2995382Z Excluding test_fx Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2996409Z Excluding test_decomp Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2997666Z Excluding test_cpp_extensions_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.2998896Z Excluding test_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3000032Z Excluding test_ops Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3001194Z Excluding test_ops_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3002469Z Excluding dynamo/test_recompile_ux Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3003794Z Excluding inductor/test_smoke Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3005746Z Excluding test_quantization Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2024-01-31T22:15:48.3007428Z Downloading https://ossci-metrics.s3.amazonaws.com/slow-tests.json to /var/lib/jenkins/workspace/test/.pytorch-slow-tests.json 2024-01-31T22:15:48.3746072Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2024-01-31T22:15:48.4263555Z Ignoring disabled issues: [''] 2024-01-31T22:15:49.5502967Z Found test times from artifacts 2024-01-31T22:15:49.5702285Z Found test times from artifacts 2024-01-31T22:15:49.5785008Z Name: high_relevance 2024-01-31T22:15:49.5785458Z Parallel tests: 2024-01-31T22:15:49.5785783Z Serial tests: 2024-01-31T22:15:49.5786220Z inductor/test_torchinductor_dynamic_shapes 1/3 2024-01-31T22:15:49.5788834Z Name: probable_relevance 2024-01-31T22:15:49.5789260Z Parallel tests: 2024-01-31T22:15:49.5791666Z Serial tests: 2024-01-31T22:15:49.5792023Z dynamo/test_dynamic_shapes 1/1 2024-01-31T22:15:49.5792433Z functorch/test_ops 3/6 2024-01-31T22:15:49.5792792Z test_ops_gradients 1/10 2024-01-31T22:15:49.5793164Z test_ops_gradients 5/10 2024-01-31T22:15:49.5793531Z test_ops_gradients 9/10 2024-01-31T22:15:49.5793889Z test_transformers 1/3 2024-01-31T22:15:49.5794256Z test_ops_fwd_gradients 2/10 2024-01-31T22:15:49.5794648Z test_ops_fwd_gradients 6/10 2024-01-31T22:15:49.5795045Z test_ops_fwd_gradients 10/10 2024-01-31T22:15:49.5795441Z test_cuda_multigpu 1/1 2024-01-31T22:15:49.5795790Z test_nn 1/1 2024-01-31T22:15:49.5796357Z test_public_bindings 1/1 2024-01-31T22:15:49.5796741Z test_multiprocessing 1/1 2024-01-31T22:15:49.5797106Z test_modules 1/3 2024-01-31T22:15:49.5797427Z test_sparse_csr 2/3 2024-01-31T22:15:49.5797779Z inductor/test_cuda_repro 1/1 2024-01-31T22:15:49.5798149Z test_utils 1/1 2024-01-31T22:15:49.5798467Z inductor/test_max_autotune 1/1 2024-01-31T22:15:49.5798888Z inductor/test_cpu_repro 1/1 2024-01-31T22:15:49.5799290Z inductor/test_fused_attention 1/1 2024-01-31T22:15:49.5799694Z dynamo/test_modules 1/1 2024-01-31T22:15:49.5800057Z dynamo/test_functions 1/1 2024-01-31T22:15:49.5800432Z dynamo/test_minifier 1/1 2024-01-31T22:15:49.5800810Z inductor/test_inductor_freezing 1/1 2024-01-31T22:15:49.5801248Z inductor/test_triton_wrapper 1/1 2024-01-31T22:15:49.5801657Z export/test_export 1/1 2024-01-31T22:15:49.5802012Z test_python_dispatch 1/1 2024-01-31T22:15:49.5802390Z inductor/test_layout_optim 1/1 2024-01-31T22:15:49.5802783Z dynamo/test_unspec 1/1 2024-01-31T22:15:49.5803148Z test_tensor_creation_ops 1/1 2024-01-31T22:15:49.5803542Z export/test_passes 1/1 2024-01-31T22:15:49.5803902Z dynamo/test_subclasses 1/1 2024-01-31T22:15:49.5804273Z test_cpp_api_parity 1/1 2024-01-31T22:15:49.5804636Z dynamo/test_backends 1/1 2024-01-31T22:15:49.5805398Z dynamo/test_after_aot 1/1 2024-01-31T22:15:49.5805821Z inductor/test_config 1/1 2024-01-31T22:15:49.5806244Z inductor/test_coordinate_descent_tuner 1/1 2024-01-31T22:15:49.5806693Z test_schema_check 1/2 2024-01-31T22:15:49.5807040Z nn/test_convolution 3/3 2024-01-31T22:15:49.5807425Z test_model_exports_to_core_aten 1/1 2024-01-31T22:15:49.5807853Z dynamo/test_pre_dispatch 1/1 2024-01-31T22:15:49.5808238Z export/test_safeguard 1/1 2024-01-31T22:15:49.5808659Z torch_np/numpy_tests/lib/test_type_check 1/1 2024-01-31T22:15:49.5809121Z dynamo/test_comptime 1/1 2024-01-31T22:15:49.5809493Z nn/test_multihead_attention 1/1 2024-01-31T22:15:49.5809915Z dynamo/test_cudagraphs 1/1 2024-01-31T22:15:49.5810295Z Name: unranked_relevance 2024-01-31T22:15:49.5810634Z Parallel tests: 2024-01-31T22:15:49.5810944Z Serial tests: 2024-01-31T22:15:49.5811275Z inductor/test_cuda_cpp_wrapper 1/3 2024-01-31T22:15:49.5811711Z inductor/test_aot_inductor_utils 1/1 2024-01-31T22:15:49.5812136Z lazy/test_bindings 1/1 2024-01-31T22:15:49.5812644Z optim/test_optim 1/1 2024-01-31T22:15:49.5812990Z Name: unlikely_relevance 2024-01-31T22:15:49.5813339Z Parallel tests: 2024-01-31T22:15:49.5813647Z Serial tests: 2024-01-31T22:15:49.5813943Z Name: none_relevance 2024-01-31T22:15:49.5814264Z Parallel tests: 2024-01-31T22:15:49.5814567Z Serial tests: 2024-01-31T22:15:49.5815325Z Starting test batch 'high_relevance' 4.76837158203125e-07 seconds after initiating testing 2024-01-31T22:15:49.5816034Z With sharding, this batch will run 1 tests 2024-01-31T22:15:51.1308020Z Running inductor/test_torchinductor_dynamic_shapes 1/3 ... [2024-01-31 22:15:51.130299] 2024-01-31T22:15:51.1310294Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_dynamic_shapes.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 22:15:51.130630] 2024-01-31T22:26:21.3653956Z 2024-01-31T22:26:21.3657840Z PRINTING LOG FILE of inductor/test_torchinductor_dynamic_shapes 1/3 (test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.3_b89fba1a042240da_.log) 2024-01-31T22:26:21.3660284Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-13372485e47c0487.xml 2024-01-31T22:26:21.3661856Z ============================= test session starts ============================== 2024-01-31T22:26:21.3663110Z platform linux -- Python 3.10.13, pytest-7.3.2, pluggy-1.4.0 -- /opt/conda/envs/py_3.10/bin/python 2024-01-31T22:26:21.3663863Z cachedir: .pytest_cache 2024-01-31T22:26:21.3665236Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2024-01-31T22:26:21.3666169Z rootdir: /var/lib/jenkins/workspace 2024-01-31T22:26:21.3666625Z configfile: pytest.ini 2024-01-31T22:26:21.3667668Z plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0 2024-01-31T22:26:21.3668707Z collecting ... collected 988 items 2024-01-31T22:26:21.3669254Z stepcurrent: Cannot find last run test, not skipping 2024-01-31T22:26:21.3842844Z Running 325 items in this shard: test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_AllenaiLongformerBase_repro_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_abs_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool1d_argmax_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d_low_prec_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_alexnet_prefix_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_aliased_buffer_reuse_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_any_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_arange6_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool2d2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_baddbmm_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_batch_norm_2d_2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_batch_norm_2d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bernoulli1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bfloat16_to_int16_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bitwise_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bool_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_both_scalars_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bucketize_default_kwargs_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bucketize_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_builtins_round_int_ndigits_zero_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_inplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_single_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_compar_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_3d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_fill_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_nd_inplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv2d_backward_channels_last_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv3d_channels_last_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_functional_bn_fuse_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_inference_heuristics_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_with_as_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_convolution2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cudnn_rnn_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cumprod_zero_dim_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dist_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div5_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div6_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dropout_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dropout_trivial_0_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_empty2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_erfinv_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_exp_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_expm1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fft_real_input_real_output_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fill1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_flip_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float16_to_int16_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float32_to_int32_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float_index_expression_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fmod_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fmod_zero_dim_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_full_like_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_glu_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_grid_sampler_2d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardsigmoid_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardswish_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardtanh_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_horizonal_fusion1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_propagation_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_propagation_remainder_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_failed_reinplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_fallback1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_fallback2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_indirect_load_broadcast_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation5_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_invalid_operand_issue1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_isinf2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_issue102546_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_l1_loss_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_large_tensor_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_lgamma_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linear1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linear2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linspace1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linspace2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_list_clearing_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log1p_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log_fp64_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_long_tensor_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_masked_fill_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_masked_fill_promotion_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_min_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d5_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d6_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d_with_indices_backward2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d_with_indices_backward4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_mixed_mm2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_multilayer_var_lowp_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_neg_index_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_new_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_new_ones_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_no_mega_fusion_during_lowering_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_no_op_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pow1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pow3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_profiler_mark_wrapper_call_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rand_like_deterministic_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_randn_like_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_randn_with_dtype_and_device_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_remove_noop_copy_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_round_correctness_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_round_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rsqrt_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rsqrt_dynamic_shapes_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scalar_input_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scaled_dot_product_attention_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter5_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter6_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_bf16_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_reduce1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scheduler_vertical_fusion1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sdpa_unaligned_mask_freezing_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_select_scatter_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sgn_extremal_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_shape_prop_torch_ones_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_silu_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_simplify_loops_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice_scatter5_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sqrt_dynamic_shapes_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_squeeze1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_stack_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sum1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_tensor1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_to_device_constant_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_to_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_transpose_add_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_transposed_propagates_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_triu_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_unroll_small_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_upsample_nearest2d_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_var_mean_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_as_complex_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_detach_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_where_broadcast_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_adaptive_avg_pool1d_argmax_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_adaptive_avg_pool2d2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_inplace_permuted_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_angle_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_any_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_arange3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_argmax_argmin_with_nan_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_as_strided_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d7_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_both_scalars_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_bucketize_add_autotune_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_batch_norm_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_copied_in_graph_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_use_after_remove_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_float_ndigits_zero_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cat_single_empty_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_compar_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_complex_fallback_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_computed_buffer_inlining_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_constant_pad_nd_inplace_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv2d_backward_channels_last_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv3d_channels_last_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_convolution3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_convolution4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cos_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cudnn_rnn_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumprod_zero_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_pattern_matcher_issue_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_zero_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_custom_op_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dist_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_div1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_div5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_trivial_0_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dtype_sympy_expr_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_embedding_bag_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_empty2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_exp2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expand_as_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expand_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expm1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_fill1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_float16_to_int16_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_float_index_expression_type_promotion_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_fmod_zero_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_full_like_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_functionalize_rng_wrappers_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_gather_scatter_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_glu_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_hardswish_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_horizonal_fusion2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_propagation_remainder_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_put_fallback1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_indirect_load_broadcast_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_activations_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_mixed_dtype_ops_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_where_pointwise_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_int_input_dynamic_shapes_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_isinf_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_l1_loss_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_broadcast_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_pointwise_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_tensor_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_lgamma_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_like_channels_last_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linear_float64_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linspace1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_log2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_logsumexp_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_masked_fill_promotion_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d_with_indices_backward4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d_with_indices_backward5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_mixed_mm2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multi_device_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multi_gpu_device_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multilayer_any_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multilayer_var_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_mutations_loop_fusion_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_narrow_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_nll_loss_forward_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_permute2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pow1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pow2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_rand_like_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_randint_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_relu_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_repeat_interleave_2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_roi_align_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_round_correctness_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scalar_input_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter_reduce1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sdpa_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sdpa_unaligned_mask_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sin_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sizehint_issue1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_mutation2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_scatter5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_view_with_graph_break_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_softmax_one_kernel_loop_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_split_with_sizes_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sqrt_dynamic_shapes_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_stack_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum_dtype_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum_int_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_tan_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_tensor_index_put_slice_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_device_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_dtype_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_memory_format_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_topk_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_transpose_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_transposed_propagates_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_uint4x2_mixed_mm_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unfold_zero_dimension_tensor_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_bicubic2d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_cat_conv_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_nearest2d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_nearest3d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_view_as_complex_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_view_on_aliased_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_xblock_divides_xnumel_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_zero_element_mutation_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_arithmetic_constant_folding_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_bool_mask_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_dynamic_stride_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_floor_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_full_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_bool_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_materialize_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op0_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op2_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op3_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op5_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op6_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_nonzero_size_factory_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_slice_index_changing_sign_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_reduction_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unwrap_storage_didnt_work_repro_cuda 2024-01-31T22:26:21.4012552Z 2024-01-31T22:26:21.4014402Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_AllenaiLongformerBase_repro_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py [2024-01-31 22:15:57,033] [0/0] torch._dynamo.variables.torch: [WARNING] Calling on only torch.SymInt arguments is not yet supported. 2024-01-31T22:26:21.4016963Z [2024-01-31 22:15:57,033] [0/0] torch._dynamo.variables.torch: [WARNING] To support this behavior, we need to allow const-propping tensors that store symint data. 2024-01-31T22:26:21.4018653Z [2024-01-31 22:15:57,033] [0/0] torch._dynamo.variables.torch: [WARNING] For now, dynamo will explicitly graph break when it encounters user code with this behavior. 2024-01-31T22:26:21.4019874Z [2024-01-31 22:15:57,033] [0/0] torch._dynamo.variables.torch: [WARNING] 2024-01-31T22:26:21.4020554Z XFAIL [0.4016s] [ 0%] 2024-01-31T22:26:21.4021572Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_abs_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7266s] [ 0%] 2024-01-31T22:26:21.4023333Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool1d_argmax_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.9235s] [ 0%] 2024-01-31T22:26:21.4025253Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [8.2555s] [ 1%] 2024-01-31T22:26:21.4027398Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d_low_prec_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.2364s] [ 1%] 2024-01-31T22:26:21.4029536Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5607s] [ 1%] 2024-01-31T22:26:21.4031536Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5419s] [ 2%] 2024-01-31T22:26:21.4033558Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_alexnet_prefix_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [6.0516s] [ 2%] 2024-01-31T22:26:21.4035637Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_aliased_buffer_reuse_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6273s] [ 2%] 2024-01-31T22:26:21.4037639Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_any_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.3121s] [ 3%] 2024-01-31T22:26:21.4039627Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_arange6_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.9645s] [ 3%] 2024-01-31T22:26:21.4041611Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5845s] [ 3%] 2024-01-31T22:26:21.4043705Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0002s] ( 2024-01-31T22:26:21.4045248Z FIXME: In the case of having equally max/min elements, our implementation returns 2024-01-31T22:26:21.4046057Z the last index instead of the first one) [ 4%] 2024-01-31T22:26:21.4047336Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool2d2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.6135s] [ 4%] 2024-01-31T22:26:21.4049037Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_baddbmm_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.1757s] [ 4%] 2024-01-31T22:26:21.4050843Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_batch_norm_2d_2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0008s] (requires cuda) [ 4%] 2024-01-31T22:26:21.4052666Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_batch_norm_2d_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.8676s] [ 5%] 2024-01-31T22:26:21.4054379Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bernoulli1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5111s] [ 5%] 2024-01-31T22:26:21.4056105Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bfloat16_to_int16_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5234s] [ 5%] 2024-01-31T22:26:21.4057820Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bitwise_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5475s] [ 6%] 2024-01-31T22:26:21.4059485Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bool_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6263s] [ 6%] 2024-01-31T22:26:21.4061177Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_both_scalars_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5505s] [ 6%] 2024-01-31T22:26:21.4063037Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bucketize_default_kwargs_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0436s] [ 7%] 2024-01-31T22:26:21.4064805Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bucketize_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.1980s] [ 7%] 2024-01-31T22:26:21.4066605Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_builtins_round_int_ndigits_zero_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4974s] [ 7%] 2024-01-31T22:26:21.4068392Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_empty_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.2134s] [ 8%] 2024-01-31T22:26:21.4070106Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_inplace_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6798s] [ 8%] 2024-01-31T22:26:21.4071849Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_single_empty_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0246s] [ 8%] 2024-01-31T22:26:21.4073564Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_compar_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5289s] [ 8%] 2024-01-31T22:26:21.4075275Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_3d_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.4523s] [ 9%] 2024-01-31T22:26:21.4077055Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_fill_dtype_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6051s] [ 9%] 2024-01-31T22:26:21.4078940Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_nd_inplace_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5161s] [ 9%] 2024-01-31T22:26:21.4080791Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv2d_backward_channels_last_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.2497s] [ 10%] 2024-01-31T22:26:21.4082628Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv3d_channels_last_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.2908s] [ 10%] 2024-01-31T22:26:21.4084423Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_functional_bn_fuse_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.3163s] [ 10%] 2024-01-31T22:26:21.4086498Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_inference_heuristics_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (cuda only test) [ 11%] 2024-01-31T22:26:21.4089266Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_conv_with_as_strided_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] failed to eagerly compile backwards for dynamic, suppressing in case backwards not needed 2024-01-31T22:26:21.4091601Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] Traceback (most recent call last): 2024-01-31T22:26:21.4093442Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1785, in compile_file 2024-01-31T22:26:21.4095326Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] subprocess.check_output(cmd, stderr=subprocess.STDOUT) 2024-01-31T22:26:21.4097096Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/subprocess.py", line 421, in check_output 2024-01-31T22:26:21.4098941Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, 2024-01-31T22:26:21.4100690Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/subprocess.py", line 526, in run 2024-01-31T22:26:21.4102348Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] raise CalledProcessError(retcode, process.args, 2024-01-31T22:26:21.4110170Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] subprocess.CalledProcessError: Command '['g++', '/tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp', '-shared', '-fPIC', '-Wall', '-std=c++17', '-Wno-unused-variable', '-Wno-unknown-pragmas', '-D_GLIBCXX_USE_CXX11_ABI=1', '-I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include', '-I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include', '-I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/TH', '-I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THC', '-I/opt/conda/envs/py_3.10/include/python3.10', '-L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib', '-L/opt/conda/envs/py_3.10/lib', '-L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib', '-ltorch', '-ltorch_cpu', '-lgomp', '-ltorch_python', '-lc10', '-mavx2', '-mfma', '-DCPU_CAPABILITY_AVX2', '-O3', '-DNDEBUG', '-ffast-math', '-fno-finite-math-only', '-fno-unsafe-math-optimizations', '-ffp-contract=off', '-march=native', '-fopenmp', '-D', 'C10_USING_CUSTOM_GENERATED_MACROS', '-o', '/tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.so']' returned non-zero exit status 1. 2024-01-31T22:26:21.4117398Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 2024-01-31T22:26:21.4118874Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] The above exception was the direct cause of the following exception: 2024-01-31T22:26:21.4120240Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 2024-01-31T22:26:21.4121462Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] Traceback (most recent call last): 2024-01-31T22:26:21.4123503Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 369, in aot_dispatch_autograd 2024-01-31T22:26:21.4125701Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] compiled_bw_func = aot_config.bw_compiler( 2024-01-31T22:26:21.4127645Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 35, in _wrapped_bw_compiler 2024-01-31T22:26:21.4129636Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return disable(disable(bw_compiler)(*args, **kwargs)) 2024-01-31T22:26:21.4131487Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 443, in _fn 2024-01-31T22:26:21.4133181Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return fn(*args, **kwargs) 2024-01-31T22:26:21.4134959Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 25, in inner 2024-01-31T22:26:21.4136746Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return fn(*args, **kwargs) 2024-01-31T22:26:21.4138499Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 248, in time_wrapper 2024-01-31T22:26:21.4140202Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] r = func(*args, **kwargs) 2024-01-31T22:26:21.4141988Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1177, in bw_compiler 2024-01-31T22:26:21.4143714Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return inner_compile( 2024-01-31T22:26:21.4145508Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper 2024-01-31T22:26:21.4147385Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] inner_compiled_fn = compiler_fn(gm, example_inputs) 2024-01-31T22:26:21.4149262Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/debug.py", line 304, in inner 2024-01-31T22:26:21.4150938Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return fn(*args, **kwargs) 2024-01-31T22:26:21.4152591Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2024-01-31T22:26:21.4154147Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return func(*args, **kwds) 2024-01-31T22:26:21.4155731Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2024-01-31T22:26:21.4157283Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return func(*args, **kwds) 2024-01-31T22:26:21.4159153Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 345, in compile_fx_inner 2024-01-31T22:26:21.4160986Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] compiled_graph = fx_codegen_and_compile( 2024-01-31T22:26:21.4162913Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 582, in fx_codegen_and_compile 2024-01-31T22:26:21.4164762Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] compiled_fn = graph.compile_to_fn() 2024-01-31T22:26:21.4166738Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1181, in compile_to_fn 2024-01-31T22:26:21.4168528Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return self.compile_to_module().call 2024-01-31T22:26:21.4170340Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 248, in time_wrapper 2024-01-31T22:26:21.4172160Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] r = func(*args, **kwargs) 2024-01-31T22:26:21.4174258Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1133, in compile_to_module 2024-01-31T22:26:21.4176326Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] mod = PyCodeCache.load_by_key_path( 2024-01-31T22:26:21.4178474Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2046, in load_by_key_path 2024-01-31T22:26:21.4180576Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] exec(code, mod.__dict__, mod.__dict__) 2024-01-31T22:26:21.4182753Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/tmp/torchinductor_jenkins/df/cdfchrfgxr2es6cevnhdyaef5dj7u4xkp6ifnrwmqi5nyog5ch74.py", line 164, in 2024-01-31T22:26:21.4184821Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] async_compile.wait(globals()) 2024-01-31T22:26:21.4186850Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2585, in wait 2024-01-31T22:26:21.4188890Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] scope[key] = result.result() 2024-01-31T22:26:21.4190836Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/_base.py", line 458, in result 2024-01-31T22:26:21.4192706Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] return self.__get_result() 2024-01-31T22:26:21.4194673Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result 2024-01-31T22:26:21.4196553Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] raise self._exception 2024-01-31T22:26:21.4198435Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-01-31T22:26:21.4200394Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] result = self.fn(*self.args, **self.kwargs) 2024-01-31T22:26:21.4202575Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1980, in load_pybinding 2024-01-31T22:26:21.4204694Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] result = cls.load(source_code + suffix, cuda) 2024-01-31T22:26:21.4206683Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1860, in load 2024-01-31T22:26:21.4208471Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] compile_file(input_path, output_path, cmd) 2024-01-31T22:26:21.4210364Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 248, in time_wrapper 2024-01-31T22:26:21.4212071Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] r = func(*args, **kwargs) 2024-01-31T22:26:21.4213859Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1800, in compile_file 2024-01-31T22:26:21.4215696Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] raise exc.CppCompileError(cmd, output) from e 2024-01-31T22:26:21.4217253Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] torch._inductor.exc.CppCompileError: C++ compile error 2024-01-31T22:26:21.4218559Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 2024-01-31T22:26:21.4219663Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] Command: 2024-01-31T22:26:21.4226703Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] g++ /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=1 -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/envs/py_3.10/include/python3.10 -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -L/opt/conda/envs/py_3.10/lib -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -ltorch_python -lc10 -mavx2 -mfma -DCPU_CAPABILITY_AVX2 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.so 2024-01-31T22:26:21.4233512Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 2024-01-31T22:26:21.4234626Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] Output: 2024-01-31T22:26:21.4236363Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] In file included from /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp:2: 2024-01-31T22:26:21.4240947Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kf/ckfqpz6yp2sujhwvtvlb2vb43nqje6bvriedz3vj5dms52hfmvis.h: In function ‘at::vec::CPU_CAPABILITY::Vectorized to_float_mask(int)’: 2024-01-31T22:26:21.4243900Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kf/ckfqpz6yp2sujhwvtvlb2vb43nqje6bvriedz3vj5dms52hfmvis.h:420:4: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 2024-01-31T22:26:21.4246461Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 420 | *(uint32_t*)&mask = src ? 0xFFFFFFFF : 0; 2024-01-31T22:26:21.4247864Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] | ^~~~~~~~~~~~~~~~ 2024-01-31T22:26:21.4250138Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp: In function ‘void kernel(const float*, long int*, float*, float*, float*, float*, float*, long int, long int)’: 2024-01-31T22:26:21.4253118Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp:92:64: error: invalid type argument of unary ‘*’ (have ‘int’) 2024-01-31T22:26:21.4255346Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 92 | auto tmp1 = decltype(tmp0)(tmp0 + 384*ks0*ks1**2 + 3072*ks0*ks1 + 6144*ks0); 2024-01-31T22:26:21.4256991Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] | ^ 2024-01-31T22:26:21.4258902Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] In file included from /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp:2: 2024-01-31T22:26:21.4262097Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kf/ckfqpz6yp2sujhwvtvlb2vb43nqje6bvriedz3vj5dms52hfmvis.h: In instantiation of ‘typename std::enable_if<(! std::is_integral<_Tp>::value)>::type atomic_add(volatile T*, T) [with T = float; typename std::enable_if<(! std::is_integral<_Tp>::value)>::type = void]’: 2024-01-31T22:26:21.4265105Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kt/cktdfj4vhgmuvl4rrcf7zdlcm6l3t35kgkoirfwcupeshe7rtg7c.cpp:42:68: required from here 2024-01-31T22:26:21.4267788Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kf/ckfqpz6yp2sujhwvtvlb2vb43nqje6bvriedz3vj5dms52hfmvis.h:227:28: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 2024-01-31T22:26:21.4270110Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 227 | reinterpret_cast(&expected)[0] = val; 2024-01-31T22:26:21.4271567Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] | ^~~~~~~~ 2024-01-31T22:26:21.4273813Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] /tmp/torchinductor_jenkins/kf/ckfqpz6yp2sujhwvtvlb2vb43nqje6bvriedz3vj5dms52hfmvis.h:228:28: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 2024-01-31T22:26:21.4276088Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 228 | reinterpret_cast(&desired)[0] = val + offset; 2024-01-31T22:26:21.4277580Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] | ^~~~~~~ 2024-01-31T22:26:21.4278848Z [2024-01-31 22:17:10,390] [48/0] torch._functorch._aot_autograd.jit_compile_runtime_wrappers: [WARNING] 2024-01-31T22:26:21.4279552Z PASSED [4.5904s] [ 11%] 2024-01-31T22:26:21.4280632Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_convolution2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.1266s] [ 11%] 2024-01-31T22:26:21.4282440Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cudnn_rnn_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (requires cuda) [ 12%] 2024-01-31T22:26:21.4284261Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cumprod_zero_dim_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5141s] [ 12%] 2024-01-31T22:26:21.4286393Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dist_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7225s] [ 12%] 2024-01-31T22:26:21.4288468Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6950s] [ 12%] 2024-01-31T22:26:21.4290112Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div5_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6616s] [ 13%] 2024-01-31T22:26:21.4291734Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_div6_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6985s] [ 13%] 2024-01-31T22:26:21.4293392Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dropout_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6033s] [ 13%] 2024-01-31T22:26:21.4295132Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_dropout_trivial_0_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5113s] [ 14%] 2024-01-31T22:26:21.4296858Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_empty2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0207s] [ 14%] 2024-01-31T22:26:21.4298521Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_erfinv_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5306s] [ 14%] 2024-01-31T22:26:21.4300179Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_exp_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6889s] [ 15%] 2024-01-31T22:26:21.4301831Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_expm1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [8.2468s] [ 15%] 2024-01-31T22:26:21.4303650Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fft_real_input_real_output_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.1167s] [ 15%] 2024-01-31T22:26:21.4305386Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fill1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6938s] [ 16%] 2024-01-31T22:26:21.4307051Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_flip_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8074s] [ 16%] 2024-01-31T22:26:21.4308748Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float16_to_int16_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5078s] [ 16%] 2024-01-31T22:26:21.4310499Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float32_to_int32_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5258s] [ 16%] 2024-01-31T22:26:21.4312278Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float_index_expression_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0017s] [ 17%] 2024-01-31T22:26:21.4313989Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fmod_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7370s] [ 17%] 2024-01-31T22:26:21.4315671Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fmod_zero_dim_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.3596s] [ 17%] 2024-01-31T22:26:21.4317379Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_full_like_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5145s] [ 18%] 2024-01-31T22:26:21.4319045Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_glu_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.1210s] [ 18%] 2024-01-31T22:26:21.4320727Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_grid_sampler_2d_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [6.6527s] [ 18%] 2024-01-31T22:26:21.4322469Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardsigmoid_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6826s] [ 19%] 2024-01-31T22:26:21.4324276Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardswish_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7067s] [ 19%] 2024-01-31T22:26:21.4326208Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_hardtanh_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6483s] [ 19%] 2024-01-31T22:26:21.4327941Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_horizonal_fusion1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6331s] [ 20%] 2024-01-31T22:26:21.4329660Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7603s] [ 20%] 2024-01-31T22:26:21.4331398Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_propagation_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.0895s] [ 20%] 2024-01-31T22:26:21.4333229Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_propagation_remainder_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7151s] [ 20%] 2024-01-31T22:26:21.4335010Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7865s] [ 21%] 2024-01-31T22:26:21.4336722Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put4_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5419s] [ 21%] 2024-01-31T22:26:21.4338508Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_failed_reinplace_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5324s] [ 21%] 2024-01-31T22:26:21.4340418Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_fallback1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.1275s] [ 22%] 2024-01-31T22:26:21.4342215Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_fallback2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6363s] [ 22%] 2024-01-31T22:26:21.4344016Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_indirect_load_broadcast_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8488s] [ 22%] 2024-01-31T22:26:21.4345807Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5555s] [ 23%] 2024-01-31T22:26:21.4347559Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7028s] [ 23%] 2024-01-31T22:26:21.4349369Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_input_mutation5_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4916s] [ 23%] 2024-01-31T22:26:21.4351154Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_invalid_operand_issue1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.2131s] [ 24%] 2024-01-31T22:26:21.4352898Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_isinf2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5065s] [ 24%] 2024-01-31T22:26:21.4354602Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_issue102546_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4785s] [ 24%] 2024-01-31T22:26:21.4356294Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_l1_loss_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6394s] [ 24%] 2024-01-31T22:26:21.4358160Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_large_tensor_reduction_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0008s] (Fails on CPU) [ 25%] 2024-01-31T22:26:21.4360069Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_lgamma_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6984s] [ 25%] 2024-01-31T22:26:21.4361775Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4955s] [ 25%] 2024-01-31T22:26:21.4363489Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5986s] [ 26%] 2024-01-31T22:26:21.4365391Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linear1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.3232s] [ 26%] 2024-01-31T22:26:21.4367096Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linear2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.6505s] [ 26%] 2024-01-31T22:26:21.4368802Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linspace1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5367s] [ 27%] 2024-01-31T22:26:21.4370503Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linspace2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0276s] [ 27%] 2024-01-31T22:26:21.4372218Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_list_clearing_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6711s] [ 27%] 2024-01-31T22:26:21.4373914Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log1p_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [8.2579s] [ 28%] 2024-01-31T22:26:21.4375683Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log_fp64_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5125s] [ 28%] 2024-01-31T22:26:21.4377370Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_long_tensor_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5309s] [ 28%] 2024-01-31T22:26:21.4379095Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_masked_fill_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5874s] [ 28%] 2024-01-31T22:26:21.4380850Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_masked_fill_promotion_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.0728s] [ 29%] 2024-01-31T22:26:21.4382582Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_min_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6080s] [ 29%] 2024-01-31T22:26:21.4384286Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d5_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.4469s] [ 29%] 2024-01-31T22:26:21.4385989Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d6_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.2507s] [ 30%] 2024-01-31T22:26:21.4387785Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d_with_indices_backward2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [24.3332s] [ 30%] 2024-01-31T22:26:21.4389740Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d_with_indices_backward4_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [23.6311s] [ 30%] 2024-01-31T22:26:21.4391545Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_mixed_mm2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6124s] [ 31%] 2024-01-31T22:26:21.4393294Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_multilayer_var_lowp_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.6382s] [ 31%] 2024-01-31T22:26:21.4395042Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_neg_index_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [15.4558s] [ 31%] 2024-01-31T22:26:21.4396814Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_new_empty_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0334s] [ 32%] 2024-01-31T22:26:21.4398563Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_new_ones_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4906s] [ 32%] 2024-01-31T22:26:21.4400357Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_no_mega_fusion_during_lowering_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [2.4321s] [ 32%] 2024-01-31T22:26:21.4402168Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_no_op_reduction_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5376s] [ 32%] 2024-01-31T22:26:21.4403867Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pow1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8145s] [ 33%] 2024-01-31T22:26:21.4405778Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pow3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4850s] [ 33%] 2024-01-31T22:26:21.4408070Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_profiler_mark_wrapper_call_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py STAGE:2024-01-31 22:20:20 1553:1553 ActivityProfilerController.cpp:314] Completed Stage: Warm Up 2024-01-31T22:26:21.4409866Z STAGE:2024-01-31 22:20:22 1553:1553 ActivityProfilerController.cpp:320] Completed Stage: Collection 2024-01-31T22:26:21.4410979Z STAGE:2024-01-31 22:20:22 1553:1553 ActivityProfilerController.cpp:324] Completed Stage: Post Processing 2024-01-31T22:26:21.4411809Z PASSED [1.5608s] [ 33%] 2024-01-31T22:26:21.4412926Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rand_like_deterministic_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5356s] [ 34%] 2024-01-31T22:26:21.4414717Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_randn_like_empty_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0614s] [ 34%] 2024-01-31T22:26:21.4416519Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_randn_with_dtype_and_device_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5680s] [ 34%] 2024-01-31T22:26:21.4418318Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_remove_noop_copy_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.2599s] [ 35%] 2024-01-31T22:26:21.4420081Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_round_correctness_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.4913s] [ 35%] 2024-01-31T22:26:21.4421783Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_round_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5832s] [ 35%] 2024-01-31T22:26:21.4423434Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rsqrt_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5405s] [ 36%] 2024-01-31T22:26:21.4425161Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_rsqrt_dynamic_shapes_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6975s] [ 36%] 2024-01-31T22:26:21.4426909Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scalar_input_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5305s] [ 36%] 2024-01-31T22:26:21.4428756Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scaled_dot_product_attention_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6861s] [ 36%] 2024-01-31T22:26:21.4430528Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5912s] [ 37%] 2024-01-31T22:26:21.4432273Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter5_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6814s] [ 37%] 2024-01-31T22:26:21.4433953Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter6_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.0925s] [ 37%] 2024-01-31T22:26:21.4435824Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py SKIPPED [0.0002s] (Flaky test, needs debugging) [ 38%] 2024-01-31T22:26:21.4437696Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.1612s] [ 38%] 2024-01-31T22:26:21.4439459Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_bf16_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [4.5666s] [ 38%] 2024-01-31T22:26:21.4441190Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_reduce1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0790s] [ 39%] 2024-01-31T22:26:21.4442991Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scheduler_vertical_fusion1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8372s] [ 39%] 2024-01-31T22:26:21.4444968Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sdpa_unaligned_mask_freezing_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.2324s] [ 39%] 2024-01-31T22:26:21.4446810Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_select_scatter_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7004s] [ 40%] 2024-01-31T22:26:21.4448676Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sgn_extremal_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5233s] [ 40%] 2024-01-31T22:26:21.4450452Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_shape_prop_torch_ones_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.9525s] [ 40%] 2024-01-31T22:26:21.4452182Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_silu_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6907s] [ 40%] 2024-01-31T22:26:21.4453885Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_simplify_loops_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8102s] [ 41%] 2024-01-31T22:26:21.4455580Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.9251s] [ 41%] 2024-01-31T22:26:21.4457291Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice_scatter5_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8260s] [ 41%] 2024-01-31T22:26:21.4459111Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sqrt_dynamic_shapes_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5541s] [ 42%] 2024-01-31T22:26:21.4460845Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_squeeze1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5944s] [ 42%] 2024-01-31T22:26:21.4462513Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_stack_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5993s] [ 42%] 2024-01-31T22:26:21.4464167Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sum1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5454s] [ 43%] 2024-01-31T22:26:21.4465838Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_tensor1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5252s] [ 43%] 2024-01-31T22:26:21.4467627Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_to_device_constant_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5393s] [ 43%] 2024-01-31T22:26:21.4469339Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_to_dtype_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6135s] [ 44%] 2024-01-31T22:26:21.4471044Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_transpose_add_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6663s] [ 44%] 2024-01-31T22:26:21.4472818Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_transposed_propagates_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.5990s] [ 44%] 2024-01-31T22:26:21.4474554Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_triu_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8299s] [ 44%] 2024-01-31T22:26:21.4476289Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_unroll_small_reduction_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.5348s] [ 45%] 2024-01-31T22:26:21.4478151Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_upsample_nearest2d_backward_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [5.0678s] [ 45%] 2024-01-31T22:26:21.4479947Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_var_mean_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.8258s] [ 45%] 2024-01-31T22:26:21.4492569Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_as_complex_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.7797s] [ 46%] 2024-01-31T22:26:21.4494447Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_detach_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [0.0550s] [ 46%] 2024-01-31T22:26:21.4496169Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views1_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [36.4627s] [ 46%] 2024-01-31T22:26:21.4497842Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views2_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [9.7869s] [ 47%] 2024-01-31T22:26:21.4499502Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views3_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [1.6237s] [ 47%] 2024-01-31T22:26:21.4501177Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views4_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.0969s] [ 47%] 2024-01-31T22:26:21.4502882Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_where_broadcast_dynamic_shapes_cpu <- test/inductor/test_torchinductor.py PASSED [3.1080s] [ 48%] 2024-01-31T22:26:21.4504694Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_adaptive_avg_pool1d_argmax_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.4575s] [ 48%] 2024-01-31T22:26:21.4506529Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_adaptive_avg_pool2d2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0875s] [ 48%] 2024-01-31T22:26:21.4508300Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0949s] [ 48%] 2024-01-31T22:26:21.4510031Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5335s] [ 49%] 2024-01-31T22:26:21.4511764Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_complex_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9846s] [ 49%] 2024-01-31T22:26:21.4513543Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_add_inplace_permuted_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.2052s] [ 49%] 2024-01-31T22:26:21.4515358Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_angle_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.8394s] [ 50%] 2024-01-31T22:26:21.4517036Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_any_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.8823s] [ 50%] 2024-01-31T22:26:21.4518709Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_arange3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2566s] [ 50%] 2024-01-31T22:26:21.4520463Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_argmax_argmin_with_nan_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [9.3917s] [ 51%] 2024-01-31T22:26:21.4522243Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_as_strided_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.0038s] [ 51%] 2024-01-31T22:26:21.4523966Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.6056s] [ 51%] 2024-01-31T22:26:21.4525858Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.2038s] [ 52%] 2024-01-31T22:26:21.4527573Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d5_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9660s] [ 52%] 2024-01-31T22:26:21.4529301Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d7_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1804s] [ 52%] 2024-01-31T22:26:21.4531155Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.6325s] [ 52%] 2024-01-31T22:26:21.4532963Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.1355s] [ 53%] 2024-01-31T22:26:21.4534721Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_both_scalars_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1633s] [ 53%] 2024-01-31T22:26:21.4536514Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_bucketize_add_autotune_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.9744s] [ 53%] 2024-01-31T22:26:21.4538322Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_batch_norm_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.3975s] [ 54%] 2024-01-31T22:26:21.4540135Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_copied_in_graph_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5276s] [ 54%] 2024-01-31T22:26:21.4541964Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_use_after_remove_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.1069s] [ 54%] 2024-01-31T22:26:21.4543755Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3200s] [ 55%] 2024-01-31T22:26:21.4545594Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_float_ndigits_zero_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1397s] [ 55%] 2024-01-31T22:26:21.4547429Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cat_single_empty_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0514s] [ 55%] 2024-01-31T22:26:21.4549216Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_compar_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2313s] [ 56%] 2024-01-31T22:26:21.4551026Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_complex_fallback_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4327s] [ 56%] 2024-01-31T22:26:21.4552844Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_computed_buffer_inlining_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2837s] [ 56%] 2024-01-31T22:26:21.4554685Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_constant_pad_nd_inplace_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1209s] [ 56%] 2024-01-31T22:26:21.4556551Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv2d_backward_channels_last_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6795s] [ 57%] 2024-01-31T22:26:21.4558660Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv3d_channels_last_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (only support cpu conv3d channels_last) [ 57%] 2024-01-31T22:26:21.4560637Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_convolution3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.9906s] [ 57%] 2024-01-31T22:26:21.4562382Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_convolution4_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.8652s] [ 58%] 2024-01-31T22:26:21.4564081Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cos_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9942s] [ 58%] 2024-01-31T22:26:21.4565934Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cudnn_rnn_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.6037s] [ 58%] 2024-01-31T22:26:21.4567747Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumprod_zero_dim_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2311s] [ 59%] 2024-01-31T22:26:21.4569485Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.7105s] [ 59%] 2024-01-31T22:26:21.4571272Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_pattern_matcher_issue_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4754s] [ 59%] 2024-01-31T22:26:21.4573091Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cumsum_zero_dim_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1068s] [ 60%] 2024-01-31T22:26:21.4574812Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_custom_op_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6389s] [ 60%] 2024-01-31T22:26:21.4576513Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dist_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7338s] [ 60%] 2024-01-31T22:26:21.4578172Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_div1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0254s] [ 60%] 2024-01-31T22:26:21.4579824Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_div5_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7212s] [ 61%] 2024-01-31T22:26:21.4581507Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2706s] [ 61%] 2024-01-31T22:26:21.4583273Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.1719s] [ 61%] 2024-01-31T22:26:21.4585079Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_trivial_0_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2662s] [ 62%] 2024-01-31T22:26:21.4586920Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dtype_sympy_expr_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.3895s] [ 62%] 2024-01-31T22:26:21.4588740Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_embedding_bag_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1783s] [ 62%] 2024-01-31T22:26:21.4590457Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_empty2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.4982s] [ 63%] 2024-01-31T22:26:21.4592128Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_exp2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9416s] [ 63%] 2024-01-31T22:26:21.4593828Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expand_as_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8955s] [ 63%] 2024-01-31T22:26:21.4595530Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expand_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6879s] [ 64%] 2024-01-31T22:26:21.4597209Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_expm1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.2504s] [ 64%] 2024-01-31T22:26:21.4598926Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_fill1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8167s] [ 64%] 2024-01-31T22:26:21.4600644Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_float16_to_int16_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1408s] [ 64%] 2024-01-31T22:26:21.4602560Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_float_index_expression_type_promotion_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1768s] [ 65%] 2024-01-31T22:26:21.4604399Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_fmod_zero_dim_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6639s] [ 65%] 2024-01-31T22:26:21.4606265Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_full_like_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2501s] [ 65%] 2024-01-31T22:26:21.4608088Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_functionalize_rng_wrappers_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0287s] [ 66%] 2024-01-31T22:26:21.4609924Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_gather_scatter_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8247s] [ 66%] 2024-01-31T22:26:21.4611631Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_glu_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.7988s] [ 66%] 2024-01-31T22:26:21.4613318Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_hardswish_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7001s] [ 67%] 2024-01-31T22:26:21.4615067Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_horizonal_fusion2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7565s] [ 67%] 2024-01-31T22:26:21.4616793Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3581s] [ 67%] 2024-01-31T22:26:21.4618483Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.1914s] [ 68%] 2024-01-31T22:26:21.4620273Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_propagation_remainder_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2653s] [ 68%] 2024-01-31T22:26:21.4622221Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_put_fallback1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4707s] [ 68%] 2024-01-31T22:26:21.4624049Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_indirect_load_broadcast_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.2462s] [ 68%] 2024-01-31T22:26:21.4625867Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_activations_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.1498s] [ 69%] 2024-01-31T22:26:21.4627683Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_mixed_dtype_ops_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2287s] [ 69%] 2024-01-31T22:26:21.4629525Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_where_pointwise_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3065s] [ 69%] 2024-01-31T22:26:21.4631321Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1728s] [ 70%] 2024-01-31T22:26:21.4633085Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1860s] [ 70%] 2024-01-31T22:26:21.4634845Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation4_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1446s] [ 70%] 2024-01-31T22:26:21.4636646Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_int_input_dynamic_shapes_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7642s] [ 71%] 2024-01-31T22:26:21.4638520Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_isinf_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5082s] [ 71%] 2024-01-31T22:26:21.4640212Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_l1_loss_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4715s] [ 71%] 2024-01-31T22:26:21.4641990Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_broadcast_reduction_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3188s] [ 72%] 2024-01-31T22:26:21.4643790Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_pointwise_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9331s] [ 72%] 2024-01-31T22:26:21.4645739Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_tensor_reduction_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7149s] [ 72%] 2024-01-31T22:26:21.4647505Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_lgamma_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.6795s] [ 72%] 2024-01-31T22:26:21.4649250Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_like_channels_last_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5991s] [ 73%] 2024-01-31T22:26:21.4651575Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linear_float64_dynamic_shapes_cuda <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/inductor_utils.py SKIPPED [0.0007s] (cuda failed for float64 linear) [ 73%] 2024-01-31T22:26:21.4653802Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linspace1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3204s] [ 73%] 2024-01-31T22:26:21.4655501Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_log2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3925s] [ 74%] 2024-01-31T22:26:21.4657201Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_logsumexp_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2733s] [ 74%] 2024-01-31T22:26:21.4659037Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_masked_fill_promotion_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6753s] [ 74%] 2024-01-31T22:26:21.4660800Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.1798s] [ 75%] 2024-01-31T22:26:21.4662516Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d4_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.9972s] [ 75%] 2024-01-31T22:26:21.4664245Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d5_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.1794s] [ 75%] 2024-01-31T22:26:21.4666084Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d_with_indices_backward4_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [15.2382s] [ 76%] 2024-01-31T22:26:21.4668012Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d_with_indices_backward5_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3039s] [ 76%] 2024-01-31T22:26:21.4669857Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_mixed_mm2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0989s] [ 76%] 2024-01-31T22:26:21.4671936Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multi_device_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py [2024-01-31 22:24:29,703] [431/0] torch._inductor.utils: [WARNING] DeviceCopy in input program 2024-01-31T22:26:21.4673609Z [2024-01-31 22:24:29,705] [431/0] torch._inductor.utils: [WARNING] DeviceCopy in input program 2024-01-31T22:26:21.4674674Z [2024-01-31 22:24:29,708] [431/0] torch._inductor.utils: [WARNING] DeviceCopy in input program 2024-01-31T22:26:21.4675324Z PASSED [1.8357s] [ 76%] 2024-01-31T22:26:21.4676602Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multi_gpu_device_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py SKIPPED [0.0002s] (requires multiple cuda devices) [ 77%] 2024-01-31T22:26:21.4678585Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multilayer_any_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.2801s] [ 77%] 2024-01-31T22:26:21.4680338Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multilayer_var_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.6598s] [ 77%] 2024-01-31T22:26:21.4682128Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_mutations_loop_fusion_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2806s] [ 78%] 2024-01-31T22:26:21.4683882Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_narrow_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0257s] [ 78%] 2024-01-31T22:26:21.4685760Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_nll_loss_forward_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8722s] [ 78%] 2024-01-31T22:26:21.4687490Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_permute2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1803s] [ 79%] 2024-01-31T22:26:21.4689174Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pow1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.7530s] [ 79%] 2024-01-31T22:26:21.4690840Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pow2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8584s] [ 79%] 2024-01-31T22:26:21.4692608Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_rand_like_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4359s] [ 80%] 2024-01-31T22:26:21.4694441Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_randint_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6677s] [ 80%] 2024-01-31T22:26:21.4696146Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0048s] [ 80%] 2024-01-31T22:26:21.4697872Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9205s] [ 80%] 2024-01-31T22:26:21.4699589Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9160s] [ 81%] 2024-01-31T22:26:21.4701280Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_relu_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4194s] [ 81%] 2024-01-31T22:26:21.4703015Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_repeat_interleave_2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0666s] [ 81%] 2024-01-31T22:26:21.4704772Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_roi_align_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6875s] [ 82%] 2024-01-31T22:26:21.4706755Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_round_correctness_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (need to debug tl.libdevice on A100/V100) [ 82%] 2024-01-31T22:26:21.4708766Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scalar_input_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.2847s] [ 82%] 2024-01-31T22:26:21.4710554Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7646s] [ 83%] 2024-01-31T22:26:21.4712249Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter4_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2408s] [ 83%] 2024-01-31T22:26:21.4713992Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter_reduce1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5809s] [ 83%] 2024-01-31T22:26:21.4715708Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sdpa_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2620s] [ 84%] 2024-01-31T22:26:21.4717447Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sdpa_unaligned_mask_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3870s] [ 84%] 2024-01-31T22:26:21.4719217Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sin_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9525s] [ 84%] 2024-01-31T22:26:21.4720937Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sizehint_issue1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.8400s] [ 84%] 2024-01-31T22:26:21.4722659Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice1_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8701s] [ 85%] 2024-01-31T22:26:21.4724334Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7719s] [ 85%] 2024-01-31T22:26:21.4726204Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_mutation2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3592s] [ 85%] 2024-01-31T22:26:21.4727966Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_scatter5_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4407s] [ 86%] 2024-01-31T22:26:21.4729768Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_view_with_graph_break_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3159s] [ 86%] 2024-01-31T22:26:21.4731685Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_softmax_one_kernel_loop_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.3190s] [ 86%] 2024-01-31T22:26:21.4733795Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_split_with_sizes_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [5.4432s] [ 87%] 2024-01-31T22:26:21.4736136Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sqrt_dynamic_shapes_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (sqrt dynamic shapes only supports cpu) [ 87%] 2024-01-31T22:26:21.4738397Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_stack_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6905s] [ 87%] 2024-01-31T22:26:21.4740346Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum2_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.4184s] [ 88%] 2024-01-31T22:26:21.4742278Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum3_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5072s] [ 88%] 2024-01-31T22:26:21.4744236Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum_dtype_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5909s] [ 88%] 2024-01-31T22:26:21.4746213Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum_int_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5413s] [ 88%] 2024-01-31T22:26:21.4748256Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_tan_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.9061s] [ 89%] 2024-01-31T22:26:21.4750310Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_tensor_index_put_slice_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [6.2469s] [ 89%] 2024-01-31T22:26:21.4752799Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_device_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py [2024-01-31 22:25:17,254] [516/0] torch._inductor.utils: [WARNING] DeviceCopy in input program 2024-01-31T22:26:21.4754689Z [2024-01-31 22:25:17,316] [517/0] torch._inductor.utils: [WARNING] DeviceCopy in input program 2024-01-31T22:26:21.4755399Z PASSED [0.3267s] [ 89%] 2024-01-31T22:26:21.4756602Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_dtype_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4121s] [ 90%] 2024-01-31T22:26:21.4758678Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_memory_format_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.7509s] [ 90%] 2024-01-31T22:26:21.4760677Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_topk_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.1399s] [ 90%] 2024-01-31T22:26:21.4762652Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_transpose_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.1434s] [ 91%] 2024-01-31T22:26:21.4764726Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_transposed_propagates_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4125s] [ 91%] 2024-01-31T22:26:21.4766664Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_uint4x2_mixed_mm_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0086s] [ 91%] 2024-01-31T22:26:21.4768484Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unfold_zero_dimension_tensor_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0291s] [ 92%] 2024-01-31T22:26:21.4770384Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_bicubic2d_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [5.8184s] [ 92%] 2024-01-31T22:26:21.4772394Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_cat_conv_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py SKIPPED [0.0007s] (only support cpu upsample_cat_conv test) [ 92%] 2024-01-31T22:26:21.4774373Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_nearest2d_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [3.6710s] [ 92%] 2024-01-31T22:26:21.4776152Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_nearest3d_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [4.7432s] [ 93%] 2024-01-31T22:26:21.4777917Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_view_as_complex_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3070s] [ 93%] 2024-01-31T22:26:21.4779674Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_view_on_aliased_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.8466s] [ 93%] 2024-01-31T22:26:21.4781453Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_xblock_divides_xnumel_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.4295s] [ 94%] 2024-01-31T22:26:21.4783254Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_zero_element_mutation_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.0925s] [ 94%] 2024-01-31T22:26:21.4784846Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_arithmetic_constant_folding_cuda SKIPPED [0.4530s] (Only runs on cpu) [ 94%] 2024-01-31T22:26:21.4786262Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_bool_mask_nobreak_cuda PASSED [1.0361s] [ 95%] 2024-01-31T22:26:21.4789323Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_dynamic_stride_nobreak_cuda SKIPPED [0.0004s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/117954 for platform(s) linux, slow, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 95%] 2024-01-31T22:26:21.4791995Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_floor_cuda PASSED [0.4809s] [ 95%] 2024-01-31T22:26:21.4793103Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_full_cuda PASSED [1.9299s] [ 96%] 2024-01-31T22:26:21.4794264Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_bool_nobreak_cuda PASSED [0.4555s] [ 96%] 2024-01-31T22:26:21.4795480Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_materialize_cuda PASSED [0.5976s] [ 96%] 2024-01-31T22:26:21.4796667Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_nobreak_cuda PASSED [0.4980s] [ 96%] 2024-01-31T22:26:21.4797825Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op0_cuda PASSED [0.5780s] [ 97%] 2024-01-31T22:26:21.4798998Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op2_cuda PASSED [0.5902s] [ 97%] 2024-01-31T22:26:21.4800153Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op3_cuda PASSED [0.5781s] [ 97%] 2024-01-31T22:26:21.4801300Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op5_cuda PASSED [0.5852s] [ 98%] 2024-01-31T22:26:21.4802450Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op6_cuda PASSED [0.5883s] [ 98%] 2024-01-31T22:26:21.4803692Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_nonzero_size_factory_nobreak_cuda PASSED [0.5201s] [ 98%] 2024-01-31T22:26:21.4805142Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_slice_index_changing_sign_cuda PASSED [1.2644s] [ 99%] 2024-01-31T22:26:21.4806675Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [1.0340s] [ 99%] 2024-01-31T22:26:21.4808205Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [0.9323s] [ 99%] 2024-01-31T22:26:21.4809548Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda FAILED [0.9301s] [ 99%] 2024-01-31T22:26:21.4810243Z 2024-01-31T22:26:21.4810446Z ==================================== RERUNS ==================================== 2024-01-31T22:26:21.4811113Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4811727Z Traceback (most recent call last): 2024-01-31T22:26:21.4812679Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4813528Z method(*args, **kwargs) 2024-01-31T22:26:21.4814416Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4815255Z method(*args, **kwargs) 2024-01-31T22:26:21.4816139Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4816968Z with policy(): 2024-01-31T22:26:21.4817825Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4818678Z raise RuntimeError(msg) 2024-01-31T22:26:21.4820198Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 4608 and is now reported as 5120 on device 0. CUDA driver allocated memory was 405471232 and is now 407568384. 2024-01-31T22:26:21.4821549Z 2024-01-31T22:26:21.4821814Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4823054Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4823919Z 2024-01-31T22:26:21.4824249Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4825036Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4825647Z Traceback (most recent call last): 2024-01-31T22:26:21.4826586Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4827429Z method(*args, **kwargs) 2024-01-31T22:26:21.4828376Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4829221Z method(*args, **kwargs) 2024-01-31T22:26:21.4830114Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4830941Z with policy(): 2024-01-31T22:26:21.4831801Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4832652Z raise RuntimeError(msg) 2024-01-31T22:26:21.4834089Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 5120 and is now reported as 5632 on device 0. CUDA driver allocated memory was 407568384 and is now 409665536. 2024-01-31T22:26:21.4835449Z 2024-01-31T22:26:21.4835707Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4836947Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4837810Z 2024-01-31T22:26:21.4838211Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4838894Z =================================== FAILURES =================================== 2024-01-31T22:26:21.4839564Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4840175Z Traceback (most recent call last): 2024-01-31T22:26:21.4841105Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4841952Z method(*args, **kwargs) 2024-01-31T22:26:21.4842843Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4843687Z method(*args, **kwargs) 2024-01-31T22:26:21.4844567Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4845560Z with policy(): 2024-01-31T22:26:21.4846434Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4847282Z raise RuntimeError(msg) 2024-01-31T22:26:21.4848726Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 5632 and is now reported as 6144 on device 0. CUDA driver allocated memory was 409665536 and is now 411762688. 2024-01-31T22:26:21.4850072Z 2024-01-31T22:26:21.4850332Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4851568Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4852506Z 2024-01-31T22:26:21.4852826Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4854338Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-13372485e47c0487.xml - 2024-01-31T22:26:21.4855619Z =========================== short test summary info ============================ 2024-01-31T22:26:21.4857973Z FAILED [0.9301s] inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda - RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 5632 and is now reported as 6144 on device 0. CUDA driver allocated memory was 409665536 and is now 411762688. 2024-01-31T22:26:21.4859866Z 2024-01-31T22:26:21.4860131Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4861349Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4862211Z 2024-01-31T22:26:21.4862536Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4863236Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2024-01-31T22:26:21.4863925Z == 1 failed, 307 passed, 14 skipped, 1 xfailed, 2 rerun in 597.87s (0:09:57) === 2024-01-31T22:26:21.4864500Z Got exit code 1 2024-01-31T22:26:21.4864789Z Retrying... 2024-01-31T22:26:21.4865816Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-3c043064e50e2c06.xml 2024-01-31T22:26:21.4866934Z ============================= test session starts ============================== 2024-01-31T22:26:21.4867808Z platform linux -- Python 3.10.13, pytest-7.3.2, pluggy-1.4.0 -- /opt/conda/envs/py_3.10/bin/python 2024-01-31T22:26:21.4868503Z cachedir: .pytest_cache 2024-01-31T22:26:21.4869361Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2024-01-31T22:26:21.4870264Z rootdir: /var/lib/jenkins/workspace 2024-01-31T22:26:21.4870665Z configfile: pytest.ini 2024-01-31T22:26:21.4871419Z plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0 2024-01-31T22:26:21.4872287Z collecting ... collected 988 items / 322 deselected / 666 selected 2024-01-31T22:26:21.4872872Z stepcurrent: skipping 322 already run items. 2024-01-31T22:26:21.4873322Z Running 3 items in this shard 2024-01-31T22:26:21.4873573Z 2024-01-31T22:26:21.4874321Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [1.7662s] [ 33%] 2024-01-31T22:26:21.4875860Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [0.2799s] [ 33%] 2024-01-31T22:26:21.4877202Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda FAILED [0.2850s] [ 33%] 2024-01-31T22:26:21.4877909Z 2024-01-31T22:26:21.4878099Z ==================================== RERUNS ==================================== 2024-01-31T22:26:21.4878761Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4879375Z Traceback (most recent call last): 2024-01-31T22:26:21.4880322Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4881159Z method(*args, **kwargs) 2024-01-31T22:26:21.4882047Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4883027Z method(*args, **kwargs) 2024-01-31T22:26:21.4883907Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4884744Z with policy(): 2024-01-31T22:26:21.4885821Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4886676Z raise RuntimeError(msg) 2024-01-31T22:26:21.4888110Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 0 and is now reported as 512 on device 0. CUDA driver allocated memory was 300613632 and is now 340459520. 2024-01-31T22:26:21.4889441Z 2024-01-31T22:26:21.4889702Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4890943Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4891811Z 2024-01-31T22:26:21.4892133Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4892929Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4893545Z Traceback (most recent call last): 2024-01-31T22:26:21.4894492Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4895329Z method(*args, **kwargs) 2024-01-31T22:26:21.4896225Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4897070Z method(*args, **kwargs) 2024-01-31T22:26:21.4897960Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4898791Z with policy(): 2024-01-31T22:26:21.4899664Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4900509Z raise RuntimeError(msg) 2024-01-31T22:26:21.4902034Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 512 and is now reported as 1024 on device 0. CUDA driver allocated memory was 340459520 and is now 342556672. 2024-01-31T22:26:21.4903394Z 2024-01-31T22:26:21.4903655Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4904901Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4905765Z 2024-01-31T22:26:21.4906096Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4906762Z =================================== FAILURES =================================== 2024-01-31T22:26:21.4907440Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4908054Z Traceback (most recent call last): 2024-01-31T22:26:21.4908993Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4909846Z method(*args, **kwargs) 2024-01-31T22:26:21.4910744Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4911586Z method(*args, **kwargs) 2024-01-31T22:26:21.4912478Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4913315Z with policy(): 2024-01-31T22:26:21.4914174Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4915112Z raise RuntimeError(msg) 2024-01-31T22:26:21.4916562Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 1024 and is now reported as 1536 on device 0. CUDA driver allocated memory was 342556672 and is now 344653824. 2024-01-31T22:26:21.4917915Z 2024-01-31T22:26:21.4918220Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4919477Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4920348Z 2024-01-31T22:26:21.4920667Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4922178Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-3c043064e50e2c06.xml - 2024-01-31T22:26:21.4923479Z =========================== short test summary info ============================ 2024-01-31T22:26:21.4926039Z FAILED [0.2850s] inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda - RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 1024 and is now reported as 1536 on device 0. CUDA driver allocated memory was 342556672 and is now 344653824. 2024-01-31T22:26:21.4927949Z 2024-01-31T22:26:21.4928250Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4929495Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4930367Z 2024-01-31T22:26:21.4930688Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4931387Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2024-01-31T22:26:21.4932031Z ================== 1 failed, 322 deselected, 2 rerun in 2.41s ================== 2024-01-31T22:26:21.4932551Z Got exit code 1 2024-01-31T22:26:21.4932837Z Retrying... 2024-01-31T22:26:21.4933933Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-d393e3907111f09e.xml 2024-01-31T22:26:21.4935059Z ============================= test session starts ============================== 2024-01-31T22:26:21.4935940Z platform linux -- Python 3.10.13, pytest-7.3.2, pluggy-1.4.0 -- /opt/conda/envs/py_3.10/bin/python 2024-01-31T22:26:21.4936630Z cachedir: .pytest_cache 2024-01-31T22:26:21.4937496Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2024-01-31T22:26:21.4938343Z rootdir: /var/lib/jenkins/workspace 2024-01-31T22:26:21.4938738Z configfile: pytest.ini 2024-01-31T22:26:21.4939509Z plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0 2024-01-31T22:26:21.4940386Z collecting ... collected 988 items / 322 deselected / 666 selected 2024-01-31T22:26:21.4940976Z stepcurrent: skipping 322 already run items. 2024-01-31T22:26:21.4941421Z Running 3 items in this shard 2024-01-31T22:26:21.4941672Z 2024-01-31T22:26:21.4942434Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [1.7939s] [ 33%] 2024-01-31T22:26:21.4943974Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda ('RERUN', {'yellow': True}) [0.2777s] [ 33%] 2024-01-31T22:26:21.4945314Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda FAILED [0.2783s] [ 33%] 2024-01-31T22:26:21.4946023Z 2024-01-31T22:26:21.4946213Z ==================================== RERUNS ==================================== 2024-01-31T22:26:21.4946950Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4947562Z Traceback (most recent call last): 2024-01-31T22:26:21.4948557Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4949404Z method(*args, **kwargs) 2024-01-31T22:26:21.4950307Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4951144Z method(*args, **kwargs) 2024-01-31T22:26:21.4952031Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4952870Z with policy(): 2024-01-31T22:26:21.4953721Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4954576Z raise RuntimeError(msg) 2024-01-31T22:26:21.4956014Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 0 and is now reported as 512 on device 0. CUDA driver allocated memory was 300613632 and is now 340459520. 2024-01-31T22:26:21.4957336Z 2024-01-31T22:26:21.4957609Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4958899Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4959771Z 2024-01-31T22:26:21.4960095Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4960888Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4961509Z Traceback (most recent call last): 2024-01-31T22:26:21.4962443Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4963294Z method(*args, **kwargs) 2024-01-31T22:26:21.4964191Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4965179Z method(*args, **kwargs) 2024-01-31T22:26:21.4966141Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4966983Z with policy(): 2024-01-31T22:26:21.4967841Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4968690Z raise RuntimeError(msg) 2024-01-31T22:26:21.4970133Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 512 and is now reported as 1024 on device 0. CUDA driver allocated memory was 340459520 and is now 342556672. 2024-01-31T22:26:21.4971480Z 2024-01-31T22:26:21.4971745Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4972978Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4973846Z 2024-01-31T22:26:21.4974179Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4974966Z =================================== FAILURES =================================== 2024-01-31T22:26:21.4975675Z ___________ TestInductorDynamicCUDA.test_unbacked_index_select_cuda ____________ 2024-01-31T22:26:21.4976289Z Traceback (most recent call last): 2024-01-31T22:26:21.4977235Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4978072Z method(*args, **kwargs) 2024-01-31T22:26:21.4978955Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2698, in wrapper 2024-01-31T22:26:21.4979881Z method(*args, **kwargs) 2024-01-31T22:26:21.4980757Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2697, in wrapper 2024-01-31T22:26:21.4981589Z with policy(): 2024-01-31T22:26:21.4982450Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2146, in __exit__ 2024-01-31T22:26:21.4983303Z raise RuntimeError(msg) 2024-01-31T22:26:21.4984737Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 1024 and is now reported as 1536 on device 0. CUDA driver allocated memory was 342556672 and is now 344653824. 2024-01-31T22:26:21.4986083Z 2024-01-31T22:26:21.4986344Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4987579Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4988454Z 2024-01-31T22:26:21.4988772Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4990271Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-d393e3907111f09e.xml - 2024-01-31T22:26:21.4991556Z =========================== short test summary info ============================ 2024-01-31T22:26:21.4993916Z FAILED [0.2783s] inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda - RuntimeError: CUDA driver API confirmed a leak in __main__.TestInductorDynamicCUDA.test_unbacked_index_select_cuda! Caching allocator allocated memory was 1024 and is now reported as 1536 on device 0. CUDA driver allocated memory was 342556672 and is now 344653824. 2024-01-31T22:26:21.4995808Z 2024-01-31T22:26:21.4996073Z To execute this test, run the following from the base repo dir: 2024-01-31T22:26:21.4997294Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unbacked_index_select_cuda 2024-01-31T22:26:21.4998180Z 2024-01-31T22:26:21.4998574Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2024-01-31T22:26:21.4999271Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2024-01-31T22:26:21.4999910Z ================== 1 failed, 322 deselected, 2 rerun in 2.43s ================== 2024-01-31T22:26:21.5000422Z Got exit code 1 2024-01-31T22:26:21.5000708Z Retrying... 2024-01-31T22:26:21.5001739Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-ddb948e49ef1ed3f.xml 2024-01-31T22:26:21.5002856Z ============================= test session starts ============================== 2024-01-31T22:26:21.5003726Z platform linux -- Python 3.10.13, pytest-7.3.2, pluggy-1.4.0 -- /opt/conda/envs/py_3.10/bin/python 2024-01-31T22:26:21.5004423Z cachedir: .pytest_cache 2024-01-31T22:26:21.5005442Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2024-01-31T22:26:21.5006297Z rootdir: /var/lib/jenkins/workspace 2024-01-31T22:26:21.5006699Z configfile: pytest.ini 2024-01-31T22:26:21.5007458Z plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0 2024-01-31T22:26:21.5008378Z collecting ... collected 988 items / 323 deselected / 665 selected 2024-01-31T22:26:21.5008967Z stepcurrent: skipping 323 already run items. 2024-01-31T22:26:21.5009417Z Running 2 items in this shard 2024-01-31T22:26:21.5009665Z 2024-01-31T22:26:21.5010220Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_reduction_cuda PASSED [1.7784s] [ 50%] 2024-01-31T22:26:21.5011780Z inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unwrap_storage_didnt_work_repro_cuda XFAIL [0.0617s] [100%] 2024-01-31T22:26:21.5012636Z 2024-01-31T22:26:21.5013876Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-ddb948e49ef1ed3f.xml - 2024-01-31T22:26:21.5015392Z ================= 1 passed, 323 deselected, 1 xfailed in 1.91s ================= 2024-01-31T22:26:21.5016711Z The following tests failed consistently: ['test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda'] 2024-01-31T22:26:21.5017555Z 2024-01-31T22:26:21.5018411Z FINISHED PRINTING LOG FILE of inductor/test_torchinductor_dynamic_shapes 1/3 (test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.3_b89fba1a042240da_.log) 2024-01-31T22:26:21.5019309Z 2024-01-31T22:26:21.5019522Z inductor/test_torchinductor_dynamic_shapes 1/3 failed! 2024-01-31T22:26:21.5020343Z Starting test batch 'probable_relevance' 631.7884356975555 seconds after initiating testing 2024-01-31T22:26:21.5021038Z With sharding, this batch will run 46 tests 2024-01-31T22:26:22.9066980Z Running dynamo/test_dynamic_shapes 1/1 ... [2024-01-31 22:26:22.906178] 2024-01-31T22:26:22.9068815Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_dynamic_shapes.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 22:26:22.906446] 2024-01-31T22:43:27.1233193Z 2024-01-31T22:43:27.1234493Z dynamo/test_dynamic_shapes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_dynamic_shapes_1.1_4022e3a8f05c11da_.log 2024-01-31T22:43:27.1914118Z Running 1324 items in this shard: test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_arguments_binding_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_cpu_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_cpu_graph_break_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_cpu_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_cpu_graph_break_inner_fn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_decorator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_device_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_float64_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_graph_break_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autocast_sdpa_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autograd_profiler_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_autograd_profiler_enabled_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_context_wrapping_grad_mode_decorator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_context_wrapping_grad_mode_nested_function_decorator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_context_wrapping_set_grad_enabled_nested_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_amp_autocast_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_event_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_stream_across_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_stream_context_manager1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_stream_context_manager2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_cuda_stream_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_disable_saved_tensors_hooks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_disable_saved_tensors_hooks_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_disable_saved_tensors_hooks_prev_disabled_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_disable_saved_tensors_hooks_prev_disabled_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_generic_context_manager_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_generic_context_manager_with_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_grad_mode_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_graph_break_inlining_autocast_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_graph_break_inlining_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_is_autocast_cpu_enabled_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_nested_generic_context_manager_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_nested_generic_context_manager_with_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_nested_grad_mode_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_no_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesCtxManagerTests::test_torch_profiler_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_T_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_add__dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_add_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_addcdiv__dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_addcdiv_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_build_list_unpack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_call_dict1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_call_dict2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_call_dict3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_call_dict4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_call_dict5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_callable_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_callable_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_callable_torch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_chunks1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_compare_constant_and_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_complex_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_const_tuple_add1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_const_tuple_add2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_constant1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_constant2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_constant3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_constant4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_context_wrapping_nested_functions_no_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_cublas_allow_tf32_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_custom_dict_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_default_dict_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_default_dict_constr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_default_dict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_default_dict_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_del_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_deque_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_device_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_device_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_copy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_fromkeys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_keys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_ops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_param_keys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_sorted_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_tuple_lazy_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_update_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dict_values_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_distributed_is_available_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_distributed_is_initialized_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dtype_compare_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_finfo_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_float_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_fn_with_self_set_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_fstrings1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_fstrings2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_fstrings3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_funcdef_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_functools_partial_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_get_autocast_gpu_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_get_calculate_correct_fan_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_get_default_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_get_privateuse1_name_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_globalfn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_globalmodule_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_globalvar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_import1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_indirect1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_indirect2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_indirect3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_inline_jit__unwrap_optional_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_inline_jit_annotations_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_inline_softmax_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_inline_with_default_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_inner_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_complex_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_contiguous_frame_counts_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_contiguous_memory_format_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_floating_point_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_fx_tracing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_in_onnx_export_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_not_null_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_quantized_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_is_sparse_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_islice_chain_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_itertools_chain_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_itertools_chain_from_iterable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_itertools_combinations_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_itertools_product_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_jit_annotate_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_len_constant_dict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_len_constant_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_len_constant_misc_iterables_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_len_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_add_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_clear_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_convert_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_index_with_constant_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_reversed_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_slice_assignment_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_sorted1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_sorted2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_list_truth_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_listarg1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_listarg2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_listarg3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_listarg4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_listarg5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_load_global_bool_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_mT_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_manual_seed_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_map_sum_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_math_radians_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_mean_sum_np_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_methodcall1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_methodcall2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_methodcall3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_min_max_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_module_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_namedtuple_defaults_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_namedtuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_namedtuple_user_methods_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndarray_builtin_functions_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndarray_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndarray_methods_returning_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndarray_reshape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndarray_transpose_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ndim_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_no_recompile_inner_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_no_recompile_inner_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_non_inlined_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_not_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_attributes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_dtype_argument_to_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_fft_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_linalg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_meshgrid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_random_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_numpy_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_ordered_dict_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partial_across_graph_break_uninvoked_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_as_input_UDF_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_as_input_partials_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_as_input_partials_mod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___annotations___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___builtins___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___call___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___class___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___closure___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___code___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___defaults___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___delattr___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___dict___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___dir___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___doc___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___eq___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___format___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___ge___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___get___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___getattribute___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___globals___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___gt___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___hash___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___init___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___init_subclass___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___kwdefaults___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___le___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___lt___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___module___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___name___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___ne___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___new___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___qualname___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___reduce___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___reduce_ex___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___repr___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___setattr___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___sizeof___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___str___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr___subclasshook___dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr_func_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_attr_keywords_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_hasattr_set_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_recompilation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_torch_op_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_torch_op_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_udf_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_udf_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_udf_kwarg_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_partials_udf_kwarg_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pop_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pos_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pow_int_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_promote_types_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_range1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_range2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_reduce_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_reduce_with_initial_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_reduce_with_none_initial_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_reduce_with_single_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_reduce_with_single_with_initial_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_dict2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_dict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_multiple_numpy_ndarray_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_numpy_ndarray_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_tuple1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_return_tuple2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_set_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_shape1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_shape2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_slice6_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_startswith_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_shortcut_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_shortcut_with_start_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_shortcut_with_start_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_with_start_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_sum_with_start_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_symbool_to_int_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_element_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_len_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_new_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_new_with_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_size_indexed_by_symint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_type2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_type3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_type4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_type5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tensor_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_torch_distributions_functions_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_torch_from_numpy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_transpose_for_scores_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tuple1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tuple2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tuple_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tuple_iadd_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_tuple_sorted_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unary_fold_op_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack_ex1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack_ex2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_unpack_ex3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_viamethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_viatorch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_T_tensor_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_add_sizes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_add_to_set_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_anomaly_aot_autograd_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_any_all_symnode_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_assigning_function_to_class_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_assigning_function_to_object_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_auto_functionalize_can_with_default_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_auto_functionalize_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_auto_functionalize_on_view_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_auto_functionalize_optional_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_auto_functionalize_with_returns_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_backend_match_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_backend_match_guard_multi_threads_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_backward_deterministic_mode_mismatch_warning_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_boolarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_build_tuple_unpack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builder_for_class_with_metaclass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builtin_abs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builtin_isinstance_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builtin_str_on_user_defined_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builtin_subclasses_as_method_on_class_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_builtin_subclasses_as_method_on_var_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_call_parent_non_class_methods_from_child_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_callpacked_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_can_auto_functionalize_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cannot_trace_mark_dynamic_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cannot_trace_mark_dynamic_safe_unreached_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cast_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cell_output1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cell_output2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_class_duner_mro_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_class_has_instancecheck_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_closure_out_of_scope_cell_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_closure_out_of_scope_cell_with_cond_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_closure_out_of_scope_cell_with_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_closure_with_mutation_and_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compare_shapes_eq_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compare_shapes_neq_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compare_shapes_tuple_eq_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compare_shapes_tuple_neq_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compare_shapes_with_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compilation_metrics_size_limit_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compile_profiler_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_compute_exception_table_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_export_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_export_single_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_side_effects_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cond_with_quantization_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_config_getattr_default_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_config_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_const_dict_variable_python_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_constant_getattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cross_entropy_loss_fancy_ctor1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cross_entropy_loss_fancy_ctor2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cross_entropy_loss_simple_ctor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_cuda_set_device_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dataclass_fields_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dataclass_local_hasattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_default_args_device_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_default_dtype_change_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_deque_append_left_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_deque_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_derpy_nn_module_usage_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_deterministic_algorithms_mutated_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_mutation_side_effect_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_order_keys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_order_keys_modules_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_order_keys_tensors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_reconstruct_keeps_original_order_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dict_subclass_cannot_be_initialized_in_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dictcomp_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_disable_flag_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dtypes_no_graphbreaks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dunder_methods_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dunder_new_function_inlining_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_duplicate_graph_break_log_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dynamic_one_hot_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dynamo_compiling_fake_tensor_to_vararg_int_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_dynamo_min_operator_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_empty_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_enum_as_dict_key_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_enum_as_dict_key_with_overloaded_str_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_enum_guards_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_enum_no_graphbreaks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_error_on_nested_fx_trace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_error_on_recompile_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_exception_table_e2e_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_exception_table_e2e_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_exception_table_encode_varint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_exception_table_entry_propagation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_exception_table_parsing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_flat_name_to_original_fqn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_fn_hasattr__name__1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_fn_hasattr__name__2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_fn_hasattr__name__3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_fold_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_frozenset_torch_func_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_funcname_cache_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_function_annotation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_generate_tensor_from_list_of_numpy_primitive_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_generate_trivial_abstract_impl_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_get_attr_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_get_cache_entry_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_get_custom_tensor_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_get_device_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_get_instruction_source_311_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_getset_descriptor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_grad_state_mutated_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_graph_break_correctly_when_passing_numpy_ndarray_to_torch_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guard_failure_fn2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guard_failure_fn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guard_failure_fn_shape_control_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guard_failure_fn_tensor_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guard_function_builder_with_cse_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guards_cse_pass_multiple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guards_cse_pass_single_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_guards_strip_function_call_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_hash_getitem_slice_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_id_of_nn_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_nn_mod1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_nn_mod2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_nn_mod3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_user_defined_object2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_user_defined_object3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_cond_user_defined_object_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_if_tensor_is_none_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inference_mode_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_closure_not_loaded_by_parent_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_dict_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_dict_function_passed_as_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_dict_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_func_jump_on_tensor_condition_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inline_list_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inplace_desugaring_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inplace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inplace_param_update_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_inplace_view_on_graph_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_input_set_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_int_comparisons_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_neg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_shape_binops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_shape_comparisons_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_int_shape_inplace_binops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_intermediary_tensor_grad_access_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_compiling_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_floating_point2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_floating_point_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_tensor2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_tensor_like2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_is_tensor_like_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_item_changes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_item_changes_new_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_item_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_iter_set_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_accumulate_symint_default_sum_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_accumulate_tensors_builtins_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_accumulate_tensors_default_sum_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_accumulate_tensors_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_accumulate_tensors_user_defined_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_groupby_pure_python_default_identify_func_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_groupby_pure_python_key_func_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_infinite_count_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_infinite_cycle_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_infinite_repeat_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_infinite_repeat_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_itertools_repeat_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_large_reduction_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_linetable_310_writer_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_linetable_311_writer1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_linetable_311_writer2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_append_return_none_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_hasattr1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_hasattr2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_iadd_side_effect_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_iadd_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_iterator_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_mul_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_list_slice_mul_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_listcomp_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_lnotab_writer_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_mandelbrot_numpy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_map_side_effects_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_map_with_quantization_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_mark_static_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_matmul1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_min_max_over_iterable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_module_complex_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_module_deepcopy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_module_not_callable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_named_parameters_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_namedtuple1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_namedtuple2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_namedtuple3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nan_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_closure_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_function_resuming_with_correct_globals_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_optimize_decorator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_optimize_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_optimize_run_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nested_wraps_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nn_functional_reduction_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nn_module_getattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nn_module_getattribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nn_sequential_invocation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nn_sequential_invocation_reposition_indices_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_no_error_on_nested_fx_trace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_no_raise_guard_partial_constraint_across_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_no_raise_guard_partial_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_non_pt2_compliant_ops_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_nonzero_static_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_not_dynamic_scope_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numel_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_array_of_arrays_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_as_global_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_fallback_on_eager_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_force_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_gt_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_int_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_min_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_ndarray_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_ndarray_graph_break_with_multiple_outputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_ndarray_works_with_builtin_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_no_raise_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_non_torch_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_random_config_to_numpy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_readonly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_recompilation_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_size_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_subdtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_take_along_axis_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_tolist_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_torch_operators_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_unique_f16_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_variable_isinstance_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_numpy_with_builtin_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_object_classmethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_object_staticmethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_onnx_shape_as_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_optimize_on_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_ordered_dict_alias_reconstruct_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_out_variants_with_resizing_on_graph_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_pair_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_param_shape_binops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_parsing_sdpa_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_patched_builtin_functions_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_pt2_compliant_ops_are_allowed_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_pt2_compliant_overload_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_pure_python_accumulate_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_py311_jump_offset_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_py_guards_mark_dynamic_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_python_slice_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_raise_guard_full_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_raise_guard_partial_constraint_across_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_raise_guard_partial_constraint_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_raise_on_backend_error_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_raises_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_rand_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_range_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_range_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_real_imag_tensor_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_recompile_on_global_state_change_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_reconstruct_set_across_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_recursive_inline_list_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_recursive_tensor_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_release_input_memory_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_release_module_memory_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_release_scope_memory_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_remove_dead_code_with_exn_table_entries_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_repeat_interleave_graphbreaks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_repro_graph_breaks_in__get_item_by_idx_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_restore_graphstate_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_return_dict_with_graph_break_and_update_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_return_nested_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_sample_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_scalar_tensor_is_equivalent_to_int_list_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_scalar_tensor_is_equivalent_to_symint_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_scalar_tensor_is_equivalent_to_symint_list_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_set_aliasing_recompiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_set_custom_tensor_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_setattr_mutation1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_setattr_mutation2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_setattr_mutation3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_and_tuple_equality_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_constructor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_create_symbolic_sizes_strides_storage_offset_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_empty_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_evaluate_expr_divisible_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_evaluate_expr_refinement_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_evaluate_expr_replacement_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_runtime_assert_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_equal_unbacked_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_no_recording_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_env_recorded_function_fallback_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_int_comparisons_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_int_inplace_binops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_shape_unpack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_side_effects_codegen_update_mutated_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_simple_set_usage_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_size_dim_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_size_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_slice_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_source_non_input_grad_access_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_str_format_assert1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_str_format_assert2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_str_format_return1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_str_format_return2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_stride_dim_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_super_calling_with_metaclass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_sys_modules_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tagging_tensors_mix_used_unused_structure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tagging_tensors_simple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_build_list_unpack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_data_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_dict1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_dict2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_dict3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_dot_grad_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_interacts_with_numpy_ndarray_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_is_contiguous_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_item_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_item_no_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_layout_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tensor_types_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_0d_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_1d_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_float_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_kd_dynamic_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_kd_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tolist_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_top_package_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_check_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_check_is_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_check_symbolic_shape_rel_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_compile_ctx_on_forward_and_training_step_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_cuda_is_available_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_cudnn_is_acceptable_bad_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_cudnn_is_acceptable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_device_python_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_distributions_lazy_property_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_dtype_python_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_dynamo_codegen_pow_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_generator_set_state_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_guards_stack_frame_register_inlining_deep_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_guards_stack_frame_register_inlining_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_nn_parameter_isinstance_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_objects_as_keys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_seed_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_size_numel_dynamic_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_size_numel_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_variable_hasattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_trace_ndarray_frame_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_trace_ndarray_frame_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_nested_py_tree_dicts_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_nested_py_tree_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_nested_py_tree_mixed_all_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_nested_py_tree_tuples_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_py_tree_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_py_tree_tensor_subclass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tracing_tree_map_only_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tuple_from_tuple_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tuple_hasattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tuple_iadd_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tuple_mul_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_tuple_mul_with_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_type_copy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_typing_typevar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_typing_union_and_optional_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_typing_variable_isinstance_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_unbacked_symint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_unhandled_exception_in_dynamo2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_unhandled_exception_in_dynamo_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_unpack4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_unpack5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_update_locals_and_stack_uses_shared_cache_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_defined_binop_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_defined_class_name_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_defined_class_python_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_function_variable_supports_enum_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_function_variable_supports_function_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_function_variable_supports_type_abcmeta_argument_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_getattr1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_getattr2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_getattribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_user_property_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_usr_cls_classmethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_usr_cls_staticmethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_variable_access_in_exception_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_variable_tracker_recursively_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_version_ci_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_with_builtin_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_write_to_closures_in_inlining_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_yield_from_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_yield_gen_and_from_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_yield_send_to_subgenerator_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_Size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_abc_setattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_add_sub_alpha_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_addr_alpha_beta_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_attached_attribute_in_dir_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_autograd_function_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_avoid_dupe_specialization_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_batch_encoding_clone_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_batch_norm_act_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_batchnorm_e2e_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_bigbird_unsqueeze_inplace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_boxes_len_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_build_map_unpack_with_call_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_call_finally_opcode_python_3_8_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_call_finally_python_3_8_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_chunk_reformer_ff_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_class_member_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_convert_boxes_to_pooler_format_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_copy_weird_strides_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_create_rand_mask_from_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dalle2_maybe_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dedup_global_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_deferred_runtime_asserts_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_delattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_delattr_raises_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dict_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dict_list_values_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dict_subclass_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_do_paste_mask_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dropout_inline_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dynamic_shapes_double_not_equal_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dynamic_shapes_float_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dynamic_shapes_implicit_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dynamic_shapes_right_side_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_ellipsis_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_embedding_backward_broadcasting_decomp_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_empty_list_contains_with_jump_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_ephemeral_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_error_return_without_exception_set_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_exception_in_dynamo_handling_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_exec_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_exec_wildcard_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_for_loop_graph_break_before_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_for_loop_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_function_in_skipfiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_functools_wraps_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_gan_repro_trying_to_backward_through_the_graph_a_second_time_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_generator_dealloc_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_get_parameter_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_grad_mode_carrying_correct_state_after_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_grad_references_cleared_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_unsupported_fake_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_guard_default_device_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_guard_fail_nested_tuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_guard_fail_tensor_bool_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_guard_ordering_shape_fail_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_bigbird_unsqueeze_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_classinstantier_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_gelu_inline_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_model_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_t5_forward_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_xsoftmax_inference_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_hf_xsoftmax_training_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_iadd_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_indexing_with_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_inductor_no_recursionerror_on_for_loops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_inference_mode_dynamic_shapes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_inplace_unsqueeze_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_intermediate_leaf_requires_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_is_symbolic_tracing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_isinstance_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_isinstance_storage_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_issue114171_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_issue1466_size_aot_autograd_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_issue175_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_jit_trace_errors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_kwargs_out_list_variable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_list_aliasing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_list_index_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_list_index_not_found_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_list_index_tensor_unsupported_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_list_self_reference_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_longformer_chunk_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_longtensor_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_maml_item_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_maml_no_item_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_many_views_with_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_maybe_multiply_symint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_merge_criteria_processor_list1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_merge_criteria_processor_list2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_module_in_skipfiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_modules_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_multi_dot_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_multi_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_named_buffers_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_negative_shape_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_nested_while_loop_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_nn_parameter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_no_grad_inline_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_norm_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_not_rewrite_assert_for_other_errors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_nullcontext1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_nullcontext2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_numpy_not_ndarray_recompiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_numpy_tobytes_no_error_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_odict_get_item_index_name_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_optim_state_references_cleared_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_optimized_deepcopy_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_out_none_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_out_overload_non_contiguous_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_output_aliases_intermediate_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_pointless_graph_removal_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_primtorch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_primtorch_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_recursive_map_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_reformer_eval_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_reformer_min_chunk_len_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_reformer_sorting_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_reformer_train_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_reinplacing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_relative_import_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_relative_import_no_modulename_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_requires_grad_guards_with_grad_mode1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_requires_grad_guards_with_grad_mode2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_restricted_list_subclass1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_restricted_list_subclass2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_restricted_list_subclass3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rewrite_assert_dont_change_bytecode_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rewrite_assert_noop_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rewrite_assert_with_msg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rewrite_assert_with_non_string_msg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rewrite_assert_without_msg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_rng_state_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_seq_append_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_setattr_requires_grad_graph_breaks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_setitem_boolean_mask_diff_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_setitem_tuple_boolean_mask_diff_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_sigmoid_out2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_sigmoid_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_size_typematch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_slice_into_list_mutable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_slicing_dynamic_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_slicing_dynamic_shape_setitem_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_sort_out2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_sort_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_specialized_stride_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_split_with_sizes_aot_autograd_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_string_format_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_sub_alpha_scalar_repro_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_swin_base_tensor_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tensor_data_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tensor_isinstance_tuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tensor_item_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tensor_set_data_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tensor_split_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_threading_local_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tokenization_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_torch_ops_aten_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_torch_tensor_ops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_torch_tensor_ops_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_torch_variable_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_tuple_enum_as_key_dict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_unspecialized_nn_module_with_torch_variable_attribute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_user_ctor_ctx_manager_custom_init_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_user_ctor_ctx_manager_custom_init_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_user_ctor_ctx_manager_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_user_defined_object_callable_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_validate_model_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_vdd_duplicate_error_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_view_dtype_overload_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_while_loop_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_while_loop_graph_break_inside_call_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_with_on_graph_break_inst_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_with_on_graph_break_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_access_by_keys_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_basicmodule1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_basicmodule2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_call_fn_with_non_const_inputs_safe_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_cfgmod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_children_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_constloop_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_conv_call_forward_directly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_conv_call_super_forward_directly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_conv_transpose_call_forward_directly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_conv_transpose_call_super_forward_directly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_densenet_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_enumvalues_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_fnmember_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_fnmembercmp1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_fnmembercmp2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_forward_directly_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_generation_tag_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_hasattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_intarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_iseval1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_iseval2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_isnonelayer_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_istraining1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_istraining2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_layerlist_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module6_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_lazy_module_no_cls_to_become_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_attribute_precedence_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_call_module_with_static_forward_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_class_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_comparison_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_forward_has_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_guard_name_is_valid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_name_string_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_property_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_module_static_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_moduledict_custom_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_moduledict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_modulelist_custom_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_modulelist_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_modulelist_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_modulemethod1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_modulemethod2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_named_children_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_nn_moduledict_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_parameterdict_custom_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_parameterdict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_parameters1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_parameters2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_parameters3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_self_mutating1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_seq_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_sequential_with_duplicated_module2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_sequential_with_duplicated_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_simple_torch_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_stringmember_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_submodules1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_submodules2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_super1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_super2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_super_class_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_tensorlist_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_torch_function_with_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_unsupportedmethod_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_unsupportedmodule_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesNNModuleTests::test_viamodulecall_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_access_class_method_from_user_class_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_access_class_method_from_user_class_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_byte_tensor_does_not_crash_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_capture_symbolic_tracing_simple_within_fake_mode_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_capture_symbolic_tracing_within_fake_mode_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_free_variables_overlapping_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_op_param_buffer_lifted_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_branch_args_mismatch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_branch_return_multiple_tensors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_branch_return_non_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_mismatch_return_length_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_mismatch_return_tensor_meta_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_missing_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_non_list_operands_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_non_tensor_operands_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_raise_user_error_on_unsupported_pred_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_cond_supported_pred_types_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_constraint_violation_error_messages_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dataclass_input_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dict_return_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dict_return_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_2_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_reorder_with_non_tensor_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_reorder_with_non_tensor_arg_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_with_non_tensor_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_with_non_tensor_arg_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_with_non_tensor_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_and_bypass_with_non_tensor_output_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dupes_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dynamic_slicing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dynamic_slicing_invalid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_dynamic_slicing_simple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_empty_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_enforce_equalities_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_compare_optimize_with_make_fx_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_cond_in_aten_symbolic_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_control_flow_with_getattr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_decomp_asserts_bad_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_decomp_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_defaults_ok_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_control_flow_error_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_dim_cleanup_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_dim_not_1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_dim_raise_on_compound_range_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_dim_range_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_bypass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_bypass_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_with_complex_reorder_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_with_complex_reorder_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_with_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_graph_with_list_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_identity_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_mark_dynamic_conflict_dynamic_dim_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_meta_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_meta_val_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_mismatched_out_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_mismatched_out_2_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_mismatched_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_mismatched_out_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_module_specify_constraints_signature_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_multi_dynamic_dim_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_nn_module_stack_patched_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_no_raise_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_pass_arg_by_name_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_pass_arg_by_name_star_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_persist_assert_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_preserve_constraints_as_metadata_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_preserves_nn_module_stack_for_get_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_raise_guard_full_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_raise_guard_partial_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_raise_on_relationship_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_shape_control_flow_1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_specialized_int_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_symbolic_shape_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_args_and_empty_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_args_with_default_None_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_args_with_default_float_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_args_with_default_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_args_with_default_tuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_builtin_op_on_assume_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_cond_branches_calling_methods_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_cond_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_cond_dynamic_shape_pred_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_cond_with_closed_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_dict_values_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_free_function_and_class_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_free_function_and_class_method_multiarg_diff_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_free_function_and_class_method_multiarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_free_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_list_nonzero_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_list_nonzero_free_function_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_method_on_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_method_on_module_invoke_twice_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_none_control_flow_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_none_control_flow_free_func_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_not_none_control_flow_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_not_none_control_flow_free_func_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_not_none_control_flow_pos_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_not_return_const_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_constant_tuple_nonzero_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_functools_wrapped_fn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_functools_wrapped_method_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_and_empty_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_with_default_None_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_with_default_float_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_with_default_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_kwargs_with_default_tuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_map_cond_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_map_zero_sized_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_map_zero_sized_tensor_suppress_errors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_module_layer_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_nonzero_static_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_parameters_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_shallow_list_copy_with_side_effects_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_shallow_list_copy_wo_side_effects_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_stack_trace_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_symbool_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_export_with_wrapped_fn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_exported_graph_serialization_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_func_return_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_func_return_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_fx_pytree_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_input_container_type_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_invalid_input_global_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_invalid_input_global_multiple_access_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_invalid_input_nonlocal_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_invalid_input_unused_nonlocal_ok_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_list_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_list_not_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_list_unpack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_list_unpack_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_map_cond_param_buffer_lifted_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_mixed_real_and_fake_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_multiple_outputs_op_with_evaluator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_nested_cond_op_param_buffer_lifted_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_not_functionalize_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_param_buffer_safe_from_mutation_recurse_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_param_buffer_safe_from_mutation_simple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_pre_dispatch_simple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_for_out_dtype_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_for_out_dtype_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_higher_order_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_higher_order_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_preserve_fx_node_metadata_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_preserve_fx_node_metadata_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_preserve_fx_node_metadata_inline_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_preserve_fx_node_metadata_recompile_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_remove_redundant_dynamic_dim_in_error_message_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dict_container_inp_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_nested_list_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_round_dynamic_shapes_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_sym_contains_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_symbolic_tracing_within_fake_mode_with_constraints_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_symbolic_tracing_within_fake_mode_with_constraints_with_parameters_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_symbool_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_torch_inference_mode_ctx_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_trivial_constraint_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_uncaptured_higher_order_op_error_not_suppresed_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_untracked_inputs_in_constraints_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_zeroes_in_and_out_different_shape_on_test_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_zeroes_in_and_out_different_shape_on_test_with_aten_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_zeroes_in_new_shape_scalar_out_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_zeroes_in_new_shape_scalar_out_permute_dupe_and_bypass_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_zeroes_in_new_shape_scalar_out_permute_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_capi_call1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_capi_call2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_capi_call3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_control_flow1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_control_flow2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_control_flow3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_control_flow4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_control_flow5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_dynamic_duck_size_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_dynamic_getitem_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_dynamic_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_dynamic_order_dependence_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_dynamic_zero_inference_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_enumerate_not_break_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_extended_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_graph_break_on_item_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_indirect_unsupported1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_indirect_unsupported2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_indirect_unsupported3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_multigraph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_no_graph_break_on_item_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_pop_after_resume_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_restore_range_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_restore_range_iter_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_restore_state_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume5_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_freevars_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_paths_join_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_tuple_iterator_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_with_no_grad1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_with_no_grad2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_resume_with_no_grad3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_stack_state1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_stack_state2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_start1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_start2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_start3_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_start4_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_tuple_iterator_mutate_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesSubGraphTests::test_tuple_iterator_return_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_access_module_attr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_constants_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_global_num_adds_guard_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_global_num_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_input_num_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_numpy_number_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_tracked_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_tracked_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_untracked_global_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_untracked_global_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_untracked_nonlocal_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_capture_value_created_in_subgraph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_branches_no_arguments_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_branches_no_arguments_no_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_free_variable_in_both_branches_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_graph_break_in_one_branch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_pytree_operands_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_side_effect_in_one_branches_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_source_fn_stack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_subgraph_name_is_valid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_cond_with_constant_pred_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_enum_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_error_message_sane_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_fallback_on_graph_break_complicated_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_fallback_on_graph_break_simple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_fallback_on_python_primitives_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_flat_list_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_fn_with_kwargs_in_torch_ops_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_freevars_as_inputs_to_wrap_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_grad_source_fn_stack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_hooks_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_inlined_functions_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_internal_nonlocal_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_lift_tensor_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_make_closure_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_lowers_to_graph_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_multi_return_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_pytree_return_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_side_effect_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_source_fn_stack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_subgraph_name_is_valid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_map_symint_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_modules_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_nested_tuple_output_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_nested_wrap_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_no_freevars_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_output_with_dict_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_return_captured_var_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_return_captured_var_used_multiple_times_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_return_captured_vars_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_same_freevar_twice_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_del_existing_attr_global_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_del_existing_attr_global_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_del_existing_attr_nonlocal_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_del_existing_attr_nonlocal_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_in_body_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_local_list_append_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_global_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_global_num_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_global_num_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_global_tensor_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_global_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_nonlocal_num_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_nonlocal_num_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_nonlocal_tensor_builtin_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_mutate_nonlocal_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_nested_nonlocal_list_append_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_nonlocal_list_append_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_existing_attr_global_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_existing_attr_global_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_existing_attr_nonlocal_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_existing_attr_nonlocal_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_new_attr_global_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_new_attr_global_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_new_attr_nonlocal_module_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_side_effect_set_new_attr_nonlocal_obj_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_symint_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_vmap_source_fn_stack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_all_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_allow_local_assign_in_body_fn_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_default_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_default_else_branch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_default_if_branch_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_int_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_only_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_kwarg_recompile_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_pytree_args_nested_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_pytree_args_not_const_symint_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_pytree_args_with_symint_constant_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_pytree_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_source_fn_stack_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesHigherOrderOpTests::test_wrap_subgraph_name_is_valid_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_capture_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_closure_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_disable_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_fn_with_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_freevar_python_scalar_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_freevar_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_has_aux_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_non_tensor_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_over_grad_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_pytree_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_two_tensor_all_grad_has_aux_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_two_tensor_has_aux_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_with_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_grad_with_side_effect_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_disable_capture_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_free_const_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_free_tensor_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_get_wrapped_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_kwargs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_multiple_invocation_in_dims_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_multiple_invocation_out_dims_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_multiple_outputs_diff_dims_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_multiple_outputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_multiple_outputs_out_dims_tuple_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_new_tensor_implicit_via_op_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_new_tensor_in_body_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_new_tensor_unused_in_body_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_over_vmap_captured_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_over_vmap_two_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_previous_illegal_op_no_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_pytree_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_side_effects_append_input_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_side_effects_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_two_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_two_inputs_tuple_in_dims_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_with_conditional_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_with_graph_break_2_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_with_graph_break_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesFuncTorchHigherOrderOpTests::test_vmap_with_graph_break_lambda_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_LSTM_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_alias_inputs_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_autograd_raises_invalid_leaf_set_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_export_joint_simple_repro_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_grad_mode_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_sequence_nr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_many_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_many_args_param_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_many_args_param_non_tensor_arg_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_many_args_param_non_tensor_arg_list_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_arg_dupe_via_dynamo_recompiles_many_with_global_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_call_fn_with_non_const_inputs_aot_safe_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_call_fn_with_non_const_inputs_aot_unsafe_control_flow_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_call_fn_with_non_const_inputs_aot_unsafe_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_double_backward_errors_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_eager_sequence_nr_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_multiple_aot_autograd_calls_dupe_args_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_mutation1_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_negative_testing_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_negative_testing_mutation_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_nn_parameter_construction_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_requires_grad_fake_via_dynamo_recompiles_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_split_with_sizes_aot_autograd_cleans_up_traceback_meta_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesTestSDPA::test_graph_break_SDPAParams_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesTestSDPA::test_input_SDPAParams_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesTestSDPA::test_intermediate_attr_access_SDPAParams_dynamic_shapes, test/dynamo/test_dynamic_shapes.py::DynamicShapesTestSDPA::test_returns_SDPAParams_dynamic_shapes 2024-01-31T22:43:27.2578890Z 2024-01-31T22:43:27.2579373Z Running functorch/test_ops 3/6 ... [2024-01-31 22:43:27.125598] 2024-01-31T22:43:27.2581088Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_ops.py', '--shard-id=3', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 22:43:27.125788] 2024-01-31T22:56:08.3147393Z 2024-01-31T22:56:08.3148654Z functorch/test_ops 3/6 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_ops_3.6_a13071104b9c19b1_.log 2024-01-31T22:56:08.3943568Z Running 1592 items in this shard: test/functorch/test_ops.py::TestOperatorsCUDA::test_data_write_errors_under_transform_cuda, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_NumpyExpMarkDirtyAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_ScaleGradGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_angle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_atleast_1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_bfloat16_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_constant_pad_nd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_corrcoef_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_cummin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_deg2rad_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_diag_embed_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_dsplit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_empty_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_exp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_fft_ifft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_fft_ihfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_fft_irfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_fft_rfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_float_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_fmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_frac_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_geometric_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_gradient_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_grid_sampler_2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_index_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_index_select_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_kron_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_ldexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_lgamma_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_cholesky_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_det_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_eigh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_householder_product_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_ldl_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_lu_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_matrix_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_slogdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linalg_tensorinv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_linspace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_log2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_lt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_masked_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_masked_logaddexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_masked_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_masked_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_native_dropout_backward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nextafter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_adaptive_avg_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_avg_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_batch_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_batch_norm_without_cudnn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_conv2d_stride_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_conv2d_stride_padding_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_conv_transpose3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_dropout2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_embedding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_embedding_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_hardswish_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_interpolate_linear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_relu6_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_nn_functional_triplet_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_ops_aten_index_put_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_polygamma_polygamma_n_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_repeat_interleave_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_resize__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_resize_as__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_roll_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_round_decimals_neg_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_select_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_sigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_signal_windows_bartlett_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_signal_windows_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_sinc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_bessel_j1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_erfcx_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_i1e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_log_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_modified_bessel_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_special_modified_bessel_k1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_sqrt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_square_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_take_along_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_trunc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_unfold_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_uniform_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_var_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_view_as_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_view_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_grad_vstack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_SelectGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp___getitem___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_allclose_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_bool_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_broadcast_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_cartesian_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_chalf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_char_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_clone_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_cos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_deg2rad_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_diagonal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_erf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_erfc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_fft_ihfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_fft_irfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_flipud_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_float_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_float_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_half_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_heaviside_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_item_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_jiterator_binary_return_by_ref_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_jiterator_unary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_kthvalue_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_ldexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_det_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_inv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_ldl_factor_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_ldl_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_lstsq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_lu_factor_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_matrix_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_pinv_hermitian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_pinv_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linalg_vander_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_logspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_long_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_lt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_lu_unpack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_masked_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_masked_argmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_masked_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_masked_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_meshgrid_variadic_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_movedim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nan_to_num_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_narrow_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_native_dropout_backward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_neg_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_new_empty_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_conv2d_stride_depthwise_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_conv2d_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_conv3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_conv_transpose3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_interpolate_linear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_max_unpool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_max_unpool2d_grad_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multilabel_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_normalize_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_pad_circular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_pixel_unshuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_relu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_softmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_triplet_margin_with_distance_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nonzero_static_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_normal_number_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_polar_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_polygamma_polygamma_n_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_polygamma_polygamma_n_4_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_positive_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_rand_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_randn_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_remainder_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_round_decimals_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_round_decimals_neg_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_rsub_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_scatter_reduce_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_scatter_reduce_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_signal_windows_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_signal_windows_general_hamming_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_sinh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_slice_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_sparse_sampled_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_hermite_polynomial_h_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_hermite_polynomial_he_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_i0e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_i1e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_log_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_modified_bessel_i1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_shifted_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_special_shifted_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_take_along_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_tensordot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_tile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_topk_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_torch_ops_aten__efficient_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_trace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_triu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_true_divide_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_unbind_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_unsqueeze_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_var_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_where_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_xlogy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvp_zero__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpjvpvmap_NumpyMulAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpjvpvmap_ZeroGradientsGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_ForwardHasDefaultArgsAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_MulGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_NumpyCubeNotComposableAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_SelectAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_T_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp___rdiv___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp___rmatmul___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp__upsample_bilinear2d_aa_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_acos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_allclose_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_argmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_as_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_atleast_2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_bfloat16_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_byte_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_cfloat_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_clamp_max_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_cummax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_cumulative_trapezoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_diagonal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_dist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_empty_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_eq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_exp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_exp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_eye_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_fft_fft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_fft_hfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_fft_ifftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_fft_irfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_flatten_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_flip_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_fmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_index_reduce_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_item_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_jiterator_4inputs_with_extra_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_kthvalue_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_lgamma_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_eigvalsh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_inv_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_ldl_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_lu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_norm_subgradients_at_zero_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_vander_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linalg_vecdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_log_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_logdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_logit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_masked_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_masked_softmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_min_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_movedim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_mvlgamma_mvlgamma_p_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_new_empty_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_avg_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_binary_cross_entropy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_conv2d_strided_padding_dilation_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_dropout3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_fractional_max_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_group_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_hardshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_instance_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_interpolate_area_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_max_unpool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_pad_constant_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_pairwise_distance_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_pdist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_pixel_shuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_rrelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_silu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_softshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_threshold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_normal_in_place_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_ones_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_pca_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_permute_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_polygamma_polygamma_n_1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_quantile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_randn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_resize_as__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_resolve_conj_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_rsub_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_select_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_signal_windows_kaiser_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_sparse_mm_reduce_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_bessel_j0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_chebyshev_polynomial_v_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_entr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_erfcx_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_i0e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_shifted_chebyshev_polynomial_v_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_special_shifted_chebyshev_polynomial_w_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_squeeze_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_sub_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_sum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_svd_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_torch_ops_aten__efficient_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_unsafe_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_vdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjp_where_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjpvmap_NumpySortAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvjpvmap_SortGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvmap_NumpyExpMarkDirtyAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_jvpvmapvmap_NumpySortAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_contiguous_grad_op_vjp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_flatten_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_list_return_dsplit_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_list_return_split_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_list_return_unbind_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_list_return_vsplit_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_mH_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_mT_grad_op_jvp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_permute_grad_op_vjp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_positive_grad_op_vjp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_view_then_inplace_view_grad_op_vjp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_CubeGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_MulGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_NumpyCubeNotComposableAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_NumpySortAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp___rdiv___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_addbmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_addcmul_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_argmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_broadcast_shapes_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_broadcast_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_bucketize_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_byte_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_cfloat_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_char_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_combinations_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_conj_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_dist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_div_trunc_rounding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_dstack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_empty_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_empty_permuted_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_equal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_erfinv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_expand_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_fft_ifftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_fft_ifftshift_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_fft_rfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_floor_divide_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_histc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_igamma_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_inner_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_int_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_isfinite_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_isin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_isinf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_jiterator_unary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_kron_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_lstsq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_lu_factor_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_matrix_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_pinv_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_slogdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_tensorsolve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linalg_vecdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_log_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_logaddexp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_logical_xor_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_masked_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_masked_normalize_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_masked_softmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_masked_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_masked_sum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_multinomial_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nanquantile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_native_batch_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_new_zeros_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_adaptive_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_conv2d_stride_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_conv2d_stride_padding_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_conv2d_stride_padding_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_conv2d_stride_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_conv2d_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_hardshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_kl_div_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_max_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_max_unpool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_max_unpool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_mse_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_mse_loss_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_pairwise_distance_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_pdist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_pixel_shuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_pixel_unshuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_scaled_dot_product_attention_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_selu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_tanhshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_nn_functional_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_normal_in_place_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_permute_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_pinverse_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_polygamma_polygamma_n_2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_rand_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_randint_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_repeat_interleave_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_rot90_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_scatter_reduce_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_sigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_sign_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_signal_windows_general_cosine_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_sparse_mm_reduce_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_special_bessel_j1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_special_i1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_special_log_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_special_modified_bessel_k1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_special_scaled_modified_bessel_k0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_square_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_take_along_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_tensor_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_trapz_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_xlogy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjp_zeros_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_NumpyTakeAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp___radd___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp___rpow___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp__segment_reduce_lengths_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp__segment_reduce_offsets_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_add_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_addcdiv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_addmv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_arange_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_argmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_argsort_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_bool_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_char_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_cummin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_diff_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_fft_rfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_fmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_fmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_full_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_igammac_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_isclose_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_isreal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_jiterator_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_jiterator_unary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_ldexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_lerp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_eig_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_householder_product_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_matrix_rank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_solve_triangular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_svd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_linalg_vander_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_logdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_logical_xor_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_lu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_argmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_log_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_logaddexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_masked_select_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_maximum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_meshgrid_variadic_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_mvlgamma_mvlgamma_p_5_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_native_batch_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_new_empty_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_new_ones_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_alpha_dropout_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_conv2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_conv2d_stride_padding_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_conv2d_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_conv_transpose2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_cross_entropy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_embedding_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_fractional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_interpolate_trilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_local_response_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_logsigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_max_unpool1d_grad_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_max_unpool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_multilabel_soft_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_pad_circular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_pad_constant_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_pad_reflect_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_prelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_scaled_dot_product_attention_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_nn_functional_upsample_nearest_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_norm_nuc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_normal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_ops_aten__new_zeros_with_same_feature_meta_functorchonly_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_positive_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_put_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_quantile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_rand_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_real_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_resize_as__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_round_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_sgn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_signal_windows_gaussian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_sin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_slice_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_sparse_mm_reduce_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_chebyshev_polynomial_w_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_laguerre_polynomial_l_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_log_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_modified_bessel_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_modified_bessel_k0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_special_shifted_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_split_list_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_std_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_take_along_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_tile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_trunc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjp_var_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjpvmap_ForwardHasDefaultArgsAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjpvmap_NumpyCubeNotComposableAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvjpvmap_NumpyExpMarkDirtyAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_MulGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_NumpyMulAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_SortGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap___rmod___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap___rsub___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap__native_batch_norm_legit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap__segment_reduce_offsets_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_addmv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_aminmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_angle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_arange_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_asinh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_bfloat16_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_bool_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_byte_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_cdouble_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_ceil_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_clamp_min_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_conj_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_constant_pad_nd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_cos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_double_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_empty_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_exp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_eye_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_fft_hfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_fft_ifft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_fft_irfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_fliplr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_float_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_index_add_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_index_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_isclose_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_isnan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_lerp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_cholesky_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_cond_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_det_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_lstsq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_matrix_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_slogdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linalg_svd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linspace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_logaddexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_logcumsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_logical_or_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_long_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_masked_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_max_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_maximum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_mode_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_mvlgamma_mvlgamma_p_5_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_narrow_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_adaptive_avg_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_adaptive_max_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_avg_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_conv1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_conv2d_stride_padding_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_conv_transpose3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_cosine_embedding_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_ctc_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_dropout2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_fractional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_gaussian_nll_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_gelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_hardsigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_huber_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_interpolate_nearest-exact_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_multilabel_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_pad_reflect_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_pixel_unshuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_scaled_dot_product_attention_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_smooth_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_nonzero_static_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_norm_nuc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_pca_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_rad2deg_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_repeat_interleave_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_select_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_sgn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_signal_windows_gaussian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_signal_windows_general_cosine_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_signal_windows_hamming_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_sparse_sampled_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_special_airy_ai_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_special_legendre_polynomial_p_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_special_modified_bessel_k1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_special_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_special_ndtri_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_split_list_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_squeeze_multiple_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_std_unbiased_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_uniform_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_unique_consecutive_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_var_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmap_zero__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vjpvmapvmap_ZeroGradientsGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_CubeGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_H_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_H_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_NumpySortAutogradFunction_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_SelectAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad___getitem___functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad___rmatmul___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad___rmatmul___cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad___rmul___cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad___rsub___cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad__segment_reduce_lengths_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad__upsample_bilinear2d_aa_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad__upsample_bilinear2d_aa_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_acos_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_addbmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_addcdiv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_addcmul_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_addmm_decomposed_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_allclose_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_arange_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_as_strided_partial_views_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_as_strided_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_asinh_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_atan2_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_atan_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_bernoulli_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_bmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_bool_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_broadcast_to_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_byte_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_byte_functorch_no_channels_last_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cat_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cat_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cauchy_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_char_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_clamp_min_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_clamp_min_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_combinations_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_complex_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_conj_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_conj_physical_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_constant_pad_nd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_contiguous_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_corrcoef_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_corrcoef_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cosh_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_count_nonzero_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cummin_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cumsum_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_cumulative_trapezoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_deg2rad_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_diagonal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_div_floor_rounding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_div_trunc_rounding_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_einsum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_empty_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_empty_permuted_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_erf_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_expm1_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_fft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_fft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_fftshift_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_hfft2_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_hfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_ifft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_irfft2_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_rfft_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fft_rfftn_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_fill_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_flip_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_float_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_frac_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_ge_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_heaviside_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_hstack_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_igamma_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_igammac_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_index_add_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_index_put_functorch_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_inner_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_int_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_le_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_lerp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_lgamma_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_cholesky_ex_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_cond_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_cross_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_det_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_eigvals_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_inv_ex_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_ldl_factor_ex_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_lstsq_grad_oriented_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_lu_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_lu_factor_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_lu_factor_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_lu_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_matrix_norm_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_matrix_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_matrix_rank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_matrix_rank_hermitian_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_pinv_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_slogdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_solve_ex_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linalg_solve_triangular_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_log_normal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_logaddexp_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_logical_and_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_logical_not_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_logspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_logsumexp_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_long_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_long_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_argmax_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_log_softmax_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_logsumexp_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_softmax_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_softmin_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_masked_var_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_matrix_exp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_max_pool2d_with_indices_backward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_min_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_min_reduction_with_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_min_reduction_with_dim_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_mm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_movedim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_mv_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_mvlgamma_mvlgamma_p_1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_mvlgamma_mvlgamma_p_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_native_dropout_backward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_native_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_new_empty_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_new_empty_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_new_full_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_new_ones_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nextafter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_adaptive_max_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_adaptive_max_pool1d_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_batch_norm_without_cudnn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_bilinear_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_binary_cross_entropy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_binary_cross_entropy_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_celu_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_conv_transpose2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_dropout2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_embedding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_embedding_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_hardsigmoid_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_huber_loss_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_interpolate_linear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_local_response_norm_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_logsigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_pool1d_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_pool2d_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_unpool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_max_unpool3d_grad_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_mse_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_pad_reflect_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_pairwise_distance_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_poisson_nll_loss_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_relu6_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_relu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_rrelu_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_smooth_l1_loss_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_softmin_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_softshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_softshrink_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_softsign_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_tanhshrink_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_threshold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_nn_functional_upsample_bilinear_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_norm_fro_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_norm_nuc_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_normal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_normal_number_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_ones_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_polygamma_polygamma_n_2_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_pow_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_rand_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_randint_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_randint_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_randn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_reshape_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_resolve_conj_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_round_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_round_decimals_0_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_round_decimals_neg_3_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_scalar_tensor_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_scatter_add_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_scatter_reduce_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_scatter_reduce_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_searchsorted_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sgn_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sigmoid_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sign_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sinc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sinh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_slice_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_bessel_j1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_bessel_y1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_chebyshev_polynomial_t_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_chebyshev_polynomial_u_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_entr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_erfcx_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_i0e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_i1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_i1e_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_scaled_modified_bessel_k0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_shifted_chebyshev_polynomial_t_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_shifted_chebyshev_polynomial_v_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_special_xlog1py_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_sqrt_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_squeeze_multiple_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_squeeze_multiple_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_stack_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_std_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_std_unbiased_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_stft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_svd_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_to_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_to_sparse_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_to_sparse_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_unfold_copy_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_uniform_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_unsafe_chunk_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_unsqueeze_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_view_as_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmap_autograd_grad_zeros_cuda_float64, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_H_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_NumpyTakeAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_SortGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall___rmul___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall__segment_reduce_offsets_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall__upsample_bilinear2d_aa_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_atan2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_baddbmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_cdist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_clamp_max_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_combinations_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_conj_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_contiguous_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_cumprod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_div_floor_rounding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_eq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_erf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_erfinv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_exp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_eye_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fft_fftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fft_ifft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fft_irfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fft_irfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fft_rfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fliplr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_float_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_floor_divide_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_fmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_frexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_NumpyExpMarkDirtyAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_ScaleGradGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule___getitem___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule__native_batch_norm_legit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule__upsample_bilinear2d_aa_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_abs_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_asinh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_broadcast_shapes_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_byte_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_combinations_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_conj_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_count_nonzero_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_cumprod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_diag_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_fft_rfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_fft_rfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_gt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_half_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_hsplit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_hypot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_isinf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_isnan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_item_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_jiterator_unary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_cond_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_cross_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_eigvals_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_inv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_inv_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_matrix_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linalg_pinv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_linspace_tensor_overload_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_logaddexp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_logical_and_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_logspace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_masked_argmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_masked_log_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_masked_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_min_reduction_with_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_neg_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_batch_norm_without_cudnn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv2d_stride_depthwise_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv2d_stride_padding_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv2d_strided_padding_dilation_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv_transpose1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv_transpose2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_conv_transpose3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_gelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_leaky_relu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_pad_replicate_negative_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_pairwise_distance_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_pixel_unshuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_nn_functional_prelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_norm_fro_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_ones_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_quantile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_randn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_roll_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_rot90_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_round_decimals_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_rsqrt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_scatter_reduce_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_scatter_reduce_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_select_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_sin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_sinh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_special_airy_ai_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_special_erfcx_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_sqrt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_square_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_sum_to_size_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_take_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_tan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_tile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_trapz_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_unfold_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_var_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_zero__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_zeros_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_has_batch_rule_zeros_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_heaviside_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_igammac_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_index_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_index_put_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_index_select_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_inner_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_int_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_int_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_isclose_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_jiterator_4inputs_with_extra_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_jiterator_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_ldexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_diagonal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_eigh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_inv_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_ldl_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_lu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_matrix_rank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_linalg_vecdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_log2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_logdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_long_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_lt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_lu_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_masked_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_meshgrid_variadic_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_min_reduction_with_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_mul_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_mvlgamma_mvlgamma_p_5_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_neg_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_adaptive_avg_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_adaptive_avg_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_celu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv2d_stride_depthwise_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv2d_stride_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_conv2d_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_dropout3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_embedding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_fractional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_gelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_interpolate_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_interpolate_linear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_max_unpool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_max_unpool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_poisson_nll_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_rrelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_norm_fro_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_norm_inf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_ones_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_ones_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_ormqr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_pca_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_pinverse_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_polygamma_polygamma_n_2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_polygamma_polygamma_n_3_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_reciprocal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_repeat_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_scatter_reduce_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_scatter_reduce_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_scatter_reduce_sum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_searchsorted_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_short_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_signal_windows_blackman_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_signal_windows_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_signal_windows_general_hamming_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_signal_windows_hann_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_signal_windows_kaiser_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_sinc_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_bessel_j1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_erfcx_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_hermite_polynomial_h_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_polygamma_special_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_special_shifted_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_std_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_take_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_tensor_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_topk_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_trace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_triu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_unfold_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_vstack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_xlogy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_zero__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpall_zeros_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_MulGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_NumpyMulAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_SortGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_T_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp___rmatmul___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp__softmax_backward_data_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp__upsample_bilinear2d_aa_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_addcdiv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_addmv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_aminmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_arange_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_argwhere_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_as_strided_partial_views_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_as_strided_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_atan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_atanh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_atleast_1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_atleast_3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_cdist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_ceil_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_chalf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_cholesky_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_cos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_count_nonzero_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_cumulative_trapezoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_diagonal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_diff_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_dist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_div_floor_rounding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_empty_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_exp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_fft_fft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_fft_ihfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_fft_irfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_fliplr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_float_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_floor_divide_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_fmod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_index_put_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_index_reduce_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_jiterator_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_jiterator_binary_return_by_ref_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_det_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_eigh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_inv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_lstsq_grad_oriented_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_lu_factor_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_norm_subgradients_at_zero_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_pinv_hermitian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_qr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_linalg_vecdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_log10_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_log_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_logaddexp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_logical_or_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_long_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_masked_cumsum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_masked_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_masked_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_masked_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_max_reduction_with_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_maximum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_new_ones_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_alpha_dropout_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_conv2d_stride_padding_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_conv_transpose3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_embedding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_grid_sample_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_hardshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_hardsigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_hardtanh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_interpolate_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_kl_div_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_local_response_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_max_pool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_max_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_max_unpool1d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_scaled_dot_product_attention_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_smooth_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_threshold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nn_functional_upsample_nearest_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_nonzero_static_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_ormqr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_polar_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_quantile_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_remainder_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_renorm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_repeat_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_scatter_reduce_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_signal_windows_general_cosine_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_signal_windows_kaiser_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_bessel_j0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_chebyshev_polynomial_v_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_legendre_polynomial_p_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_scaled_modified_bessel_k0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_special_shifted_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_split_list_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_stft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_sum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_svd_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_tan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_tanh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_trapezoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_tril_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_true_divide_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_view_as_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvmap_ScaleGradGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_ForwardHasDefaultArgsAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_acos_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_addcmul_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_all_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_argmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_as_strided_partial_views_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_asin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_atleast_2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_bernoulli_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_bfloat16_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_broadcast_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_byte_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_cartesian_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_cholesky_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_chunk_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_clamp_min_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_column_stack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_constant_pad_nd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_cosh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_cumprod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_expand_as_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_exponential_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_fft_fft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_fft_hfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_fft_hfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_fft_ifftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_flip_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_fmod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_gradient_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_half_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_NumpyCubeNotComposableAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_ScaleGradGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule__segment_reduce_offsets_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_addcdiv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_addr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_any_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_as_strided_partial_views_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_block_diag_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_bmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_bucketize_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_cauchy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_chunk_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_clamp_min_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_combinations_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_cosh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_double_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_dsplit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_erf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_exp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_eye_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_fft_fftshift_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_fft_hfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_fft_ifft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_fft_ihfftn_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_fft_rfft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_ge_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_hypot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_igamma_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_index_put_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_index_put_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_isinf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_isneginf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_isreal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_jiterator_2inputs_2outputs_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_jiterator_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_kthvalue_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_cholesky_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_ldl_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_lstsq_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_solve_ex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_solve_triangular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_vander_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linalg_vecdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_linspace_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_log10_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_log_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_logaddexp2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_logdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_lu_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_masked_log_softmax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_masked_normalize_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_masked_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_max_binary_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_meshgrid_list_of_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_minimum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_mm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_mv_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nansum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_narrow_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_native_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_avg_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_conv2d_stride_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_conv2d_strided_padding_dilation_with_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_conv_transpose2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_cosine_embedding_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_cosine_similarity_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_ctc_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_embedding_functorch_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_grid_sample_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_hardshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_hardsigmoid_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_hardswish_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_hinge_embedding_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_huber_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_interpolate_trilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_leaky_relu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_linear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_max_unpool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_mse_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_multi_head_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_pad_circular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_nn_functional_tanhshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_norm_inf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_permute_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_pinverse_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_polygamma_polygamma_n_0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_randint_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_scatter_reduce_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_select_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_sign_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_signal_windows_gaussian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_signal_windows_kaiser_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_signbit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_sparse_sampled_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_bessel_y0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_modified_bessel_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_ndtri_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_scaled_modified_bessel_k1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_shifted_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_special_spherical_bessel_j0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_stack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_std_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_take_along_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_tan_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_tensor_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_topk_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_torch_ops_aten__efficient_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_triu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_uniform_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_var_mean_unbiased_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_has_batch_rule_vstack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_hsplit_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_hstack_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_hypot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_index_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_int_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_isneginf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_isreal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_item_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_det_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_det_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_eigvalsh_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_matrix_rank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_multi_dot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_linalg_pinv_hermitian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_logical_xor_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_lt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_argmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_logaddexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_prod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_masked_softmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_mean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_median_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_meshgrid_list_of_tensors_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_msort_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_mvlgamma_mvlgamma_p_5_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nansum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_adaptive_max_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_conv_transpose2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_fractional_max_pool2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_interpolate_bicubic_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_layer_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_local_response_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_max_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_nll_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_pad_replicate_negative_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_nn_functional_triplet_margin_with_distance_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_pca_lowrank_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_permute_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_positive_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_put_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_qr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_randint_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_randn_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_reciprocal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_resize_as__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_rot90_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_scatter_reduce_amax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_signal_windows_blackman_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_signal_windows_gaussian_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_signal_windows_hann_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_chebyshev_polynomial_w_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_hermite_polynomial_he_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_i1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_scaled_modified_bessel_k0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_shifted_chebyshev_polynomial_t_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_shifted_chebyshev_polynomial_u_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_special_shifted_chebyshev_polynomial_w_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_sub_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_tensor_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_torch_ops_aten__efficient_attention_forward_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_unique_consecutive_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_vdot_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_where_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_zero__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjp_zeros_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_SelectGenVmapAutogradFunction_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp___rsub___cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp__segment_reduce_lengths_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_addbmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_addmm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_amin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_any_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_argmin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_argsort_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_asin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_block_diag_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_chalf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_char_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_complex_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_contiguous_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_cummax_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_cummin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_diff_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_einsum_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_empty_strided_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_erf_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_exp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_expand_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_fft_fft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_fft_ifft_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_fft_ihfft2_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_fmod_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_half_functorch_no_channels_last_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_i0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_igammac_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_isfinite_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_jiterator_4inputs_with_extra_args_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_cholesky_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_det_singular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_lu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_lu_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_matrix_norm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_matrix_power_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_solve_triangular_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_linalg_svdvals_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_log10_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_log_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_log_softmax_with_dtype_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_logdet_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_lu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_masked_fill_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_masked_logsumexp_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_masked_select_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_max_reduction_no_dim_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_mm_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nanmean_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_narrow_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_ne_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_new_zeros_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nextafter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_adaptive_avg_pool3d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_binary_cross_entropy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_conv2d_no_bias_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_dropout2d_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_dropout_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_embedding_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_gelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_hardshrink_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_hinge_embedding_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_interpolate_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_margin_ranking_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_pdist_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_pixel_unshuffle_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_poisson_nll_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_rrelu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_scaled_dot_product_attention_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_silu_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_smooth_l1_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_triplet_margin_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_triplet_margin_with_distance_loss_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_unfold_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_nn_functional_upsample_bilinear_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_normal_in_place_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_permute_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_polygamma_polygamma_n_1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_put_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_randint_like_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_reciprocal_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_resize__cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_roll_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_rsqrt_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_select_scatter_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_sign_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_signal_windows_bartlett_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_sin_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_slice_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_bessel_y0_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_modified_bessel_k1_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_ndtr_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_ndtri_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_shifted_chebyshev_polynomial_w_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_special_xlog1py_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_split_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_squeeze_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_std_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_sum_to_size_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_svd_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_to_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_trapz_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_triangular_solve_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_var_mean_unbiased_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvjp_view_copy_cuda_float32, test/functorch/test_ops.py::TestOperatorsCUDA::test_vmapvjpvmap_ZeroGradientsGenVmapAutogradFunction_cuda_float32 2024-01-31T22:56:08.4641099Z 2024-01-31T22:56:08.4641560Z Running test_ops_gradients 1/10 ... [2024-01-31 22:56:08.317544] 2024-01-31T22:56:08.4643309Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_gradients.py', '--shard-id=1', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 22:56:08.317767] 2024-01-31T23:06:04.5116860Z 2024-01-31T23:06:04.5118354Z test_ops_gradients 1/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_gradients_1.10_c6d65a74233a03fa_.log 2024-01-31T23:06:04.5417808Z Running 527 items in this shard: test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_H_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_NumpySplitCopyWithIntCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad___rmod___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad___rmul___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad__segment_reduce_lengths_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_addmm_decomposed_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_as_strided_partial_views_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_atleast_2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_bfloat16_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_broadcast_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cdouble_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ceil_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_chalf_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_column_stack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_count_nonzero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_count_nonzero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cross_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_deg2rad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_div_no_rounding_mode_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_dot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_double_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_empty_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_expand_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_fftshift_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_irfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_irfftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_rfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fill_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_flipud_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_float_power_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ge_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_gradient_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_grid_sampler_2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_gt_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_i0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_igamma_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_index_put_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_jiterator_binary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_cholesky_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_cross_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_det_singular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_diagonal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_householder_product_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_inv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_lu_factor_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_pinv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_solve_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_tensorsolve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log1p_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log_softmax_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logcumsumexp_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logdet_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logspace_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_lu_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_logaddexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_matmul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_min_reduction_no_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mvlgamma_mvlgamma_p_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_narrow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_new_full_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_new_zeros_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_avg_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_avg_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_conv1d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_conv_transpose1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_dropout3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_feature_alpha_dropout_without_train_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_constant_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_reflect_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_replicate_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_replicate_negative_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pixel_shuffle_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_relu6_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_rrelu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_scaled_dot_product_attention_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_smooth_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_softsign_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_upsample_nearest_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_polygamma_polygamma_n_2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_polygamma_polygamma_n_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_pow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_quantile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ravel_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_resize_as__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_resolve_conj_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_rsub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scatter_add_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scatter_reduce_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_signal_windows_bartlett_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sinh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_slice_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_slice_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sparse_sampled_addmm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sparse_sampled_addmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_bessel_y0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_scaled_modified_bessel_k1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_shifted_chebyshev_polynomial_t_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_split_list_args_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_split_with_sizes_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_std_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_t_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_take_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tensordot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_trunc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_unfold_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_var_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_view_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_where_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_H_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_NumpyCatCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___radd___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___rdiv___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad__segment_reduce_offsets_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addcmul_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addmm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_amax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_atan2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_bucketize_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cauchy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_conj_physical_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cosh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cummax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_diff_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_empty_permuted_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_eq_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_fftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_fftshift_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_fftshift_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_hfftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_ifftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_ifftshift_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_ihfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_irfft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_full_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_gradient_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_index_reduce_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_int_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_jiterator_binary_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_le_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_cholesky_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_diagonal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_eigh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_lu_factor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_matrix_rank_hermitian_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_multi_dot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_multi_dot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_solve_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_solve_triangular_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_tensorinv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logcumsumexp_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logical_or_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logical_xor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_mT_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_var_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_matmul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_max_reduction_no_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_narrow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_ne_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_new_empty_strided_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_avg_pool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_conv2d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_interpolate_area_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_interpolate_trilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_l1_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_layer_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_pad_replicate_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_scaled_dot_product_attention_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_silu_complex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_soft_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_softsign_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_unfold_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_upsample_nearest_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_outer_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_pinverse_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_put_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_rand_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_randn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_reciprocal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_renorm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_reshape_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_reshape_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_scatter_add_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_scatter_reduce_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_sgn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_softmax_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_bessel_j1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_shifted_chebyshev_polynomial_t_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_split_with_sizes_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_stft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_take_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unfold_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unique_consecutive_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unsafe_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_view_as_complex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_view_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_vstack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_xlogy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_NumpySplitCopyWithIntCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad___getitem___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad___radd___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad___rmatmul___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_addbmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_argwhere_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_atleast_3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_bfloat16_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cdist_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cholesky_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cholesky_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_chunk_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_chunk_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_contiguous_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_count_nonzero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_diagflat_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_diagonal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_div_no_rounding_mode_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_dsplit_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_empty_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_equal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_ifftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_ifftshift_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_irfft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fill_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_full_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_geqrf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_gradient_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_gt_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_hsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_index_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_index_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_kron_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_kron_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_le_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_inv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_ldl_factor_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_ldl_factor_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_ldl_solve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_lu_factor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_matrix_rank_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_matrix_rank_hermitian_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_pinv_singular_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_slogdet_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_svd_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_svdvals_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_svdvals_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_vecdot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log1p_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log_softmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_logspace_tensor_overload_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_cumsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_logsumexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_normalize_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_matmul_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_min_binary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_movedim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_mvlgamma_mvlgamma_p_5_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nansum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_new_ones_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_new_ones_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_celu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv_transpose3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv_transpose3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_dropout3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_group_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_hinge_embedding_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_interpolate_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_interpolate_trilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_kl_div_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_max_unpool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_multi_head_attention_forward_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pad_circular_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pixel_unshuffle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_smooth_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_softmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_norm_nuc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_normal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_normal_number_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_outer_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_ravel_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_reciprocal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_resolve_conj_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_rot90_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_round_decimals_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_rsub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_scatter_reduce_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_short_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_short_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_signal_windows_gaussian_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_slice_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_shifted_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_square_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_squeeze_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_stack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_std_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_take_along_dim_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_to_sparse_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_to_sparse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_trace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_triu_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unbind_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unfold_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unsqueeze_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_var_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_var_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_var_unbiased_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_view_as_real_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_vsplit_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_vsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_where_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_NumpySplitCopyWithIntCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad___getitem___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad___rmul___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_as_strided_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_as_strided_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_atan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_atleast_1d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_byte_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cdouble_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_char_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cholesky_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cholesky_inverse_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_column_stack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cos_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cumulative_trapezoid_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_diagonal_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_digamma_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_dsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_einsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_empty_permuted_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_erfinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_expand_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_fft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_fftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_fftshift_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_irfftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_flip_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_half_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_heaviside_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_i0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_inner_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_inner_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_isclose_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_jiterator_binary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_jiterator_unary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_kron_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_kthvalue_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_det_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_diagonal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_lu_factor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_lu_factor_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_norm_subgradients_at_zero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_solve_triangular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_vander_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_vector_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logical_or_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_lt_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_argmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_log_softmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_matmul_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_mm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_new_zeros_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_alpha_dropout_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_conv2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_conv_transpose3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_cosine_similarity_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_hardshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_linear_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_normalize_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_pad_replicate_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_pairwise_distance_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_scaled_dot_product_attention_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_selu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_smooth_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_softplus_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_tanhshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_norm_fro_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_norm_fro_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_normal_in_place_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_ones_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_polygamma_polygamma_n_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_randint_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_reshape_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_scatter_reduce_sum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_short_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_signal_windows_general_hamming_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_sin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_hermite_polynomial_h_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_i1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_zeta_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_stack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_sub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_svd_lowrank_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_tanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_unbind_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_view_as_complex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_xlogy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_T_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad___getitem___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad___radd___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad___rmatmul___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad__native_batch_norm_legit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_add_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addcdiv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addmv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_allclose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_atanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_bernoulli_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_bfloat16_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_chalf_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cholesky_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_chunk_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_contiguous_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_count_nonzero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_diff_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_div_no_rounding_mode_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_div_no_rounding_mode_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_eye_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_fftshift_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_ifftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_ihfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_histc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_isnan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_jiterator_4inputs_with_extra_args_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_le_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_diagonal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_eigvals_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_eigvalsh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_lu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_lu_factor_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_lu_solve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_multi_dot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_pinv_singular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_vecdot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_log1p_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_logical_xor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_mT_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_softmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_var_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_max_binary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_meshgrid_variadic_tensors_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_neg_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_binary_cross_entropy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_celu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_conv2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_conv_transpose3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_huber_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_interpolate_bicubic_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_interpolate_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_interpolate_nearest-exact_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_leaky_relu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_max_unpool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_normalize_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_normalize_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_constant_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pairwise_distance_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pdist_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pixel_shuffle_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_prelu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_relu6_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_smooth_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_softmin_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_tanhshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nonzero_static_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_norm_nuc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_ones_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_permute_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_polygamma_polygamma_n_2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_polygamma_polygamma_n_4_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_pow_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_pow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_put_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_qr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_ravel_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_real_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_renorm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_reshape_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_round_decimals_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_round_decimals_neg_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_rsub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_scatter_reduce_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_bessel_y0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_squeeze_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_squeeze_multiple_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_sub_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tensor_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tensordot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_to_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unflatten_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unique_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_var_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_var_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_zeros_like_cuda_complex128 2024-01-31T23:06:04.5646320Z 2024-01-31T23:06:04.5646669Z Running test_ops_gradients 5/10 ... [2024-01-31 23:06:04.512841] 2024-01-31T23:06:04.5648359Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_gradients.py', '--shard-id=5', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:06:04.513820] 2024-01-31T23:17:10.0288474Z 2024-01-31T23:17:10.0290075Z test_ops_gradients 5/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_gradients_5.10_e99ab14062b89ae5_.log 2024-01-31T23:17:10.0522282Z Running 524 items in this shard: test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_NumpyCubeCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_NumpyMulCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_abs_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_add_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_argmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_argmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_as_strided_partial_views_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_atleast_3d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_bfloat16_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_bool_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cauchy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_clamp_min_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_contiguous_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cos_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cosh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_diag_embed_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_div_trunc_rounding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_equal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_erfinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_hfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_ihfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fliplr_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_half_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_hstack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_imag_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_isclose_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_isin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_jiterator_4inputs_with_extra_args_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_eigvals_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_inv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_inv_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_lu_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_lu_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_matrix_rank_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_matrix_rank_hermitian_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_norm_subgradients_at_zero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_pinv_singular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_qr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linspace_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log10_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log_softmax_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logaddexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logical_not_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_long_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mH_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_cumsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_std_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_var_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_matrix_exp_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_max_pool2d_with_indices_backward_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_meshgrid_list_of_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mvlgamma_mvlgamma_p_1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nansum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_native_batch_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ne_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_new_ones_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_adaptive_avg_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_group_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_huber_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_interpolate_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_linear_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_multilabel_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_soft_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_softmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nonzero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ones_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ones_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_outer_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_permute_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_pinverse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_polygamma_polygamma_n_4_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_ravel_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_resize__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_round_decimals_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scalar_tensor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scatter_add_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scatter_reduce_sum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sigmoid_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_signal_windows_general_cosine_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sinc_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_softmax_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_airy_ai_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_hermite_polynomial_he_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_legendre_polynomial_p_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_square_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_squeeze_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_squeeze_multiple_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_std_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_std_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_take_along_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tile_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_trace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_transpose_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tril_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_triu_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_unique_consecutive_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_vdot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_view_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_vsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_zeros_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_NestedMapControlflowOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_TripleNestedMapControlflowOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___rmul___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___rsub___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addbmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addmm_decomposed_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_angle_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_argwhere_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_atleast_2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_bernoulli_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_bfloat16_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_bmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cholesky_inverse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_constant_pad_nd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cumsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_einsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_erfc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_erfinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_expm1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_hfft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_hfftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_ifft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fill_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_flatten_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_flipud_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_frac_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_ge_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_geqrf_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_hypot_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_igamma_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_index_add_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_index_put_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_isnan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_isneginf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_isreal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_jiterator_2inputs_2outputs_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_det_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_householder_product_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_ldl_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_lstsq_grad_oriented_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_lu_factor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_matrix_power_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_matrix_power_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_matrix_rank_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_qr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_solve_triangular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_svd_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_log2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_log_normal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logcumsumexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logdet_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_amax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_argmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_median_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_sum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_max_binary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_maximum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_meshgrid_list_of_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_meshgrid_list_of_tensors_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_min_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nansum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_native_layer_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_new_empty_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_adaptive_avg_pool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_batch_norm_without_cudnn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_ctc_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_feature_alpha_dropout_without_train_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_local_response_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_max_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_mish_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_softmin_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_triplet_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_norm_nuc_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_norm_nuc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_normal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_normal_in_place_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_outer_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_pinverse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_ravel_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_repeat_interleave_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_resize__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_round_decimals_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_searchsorted_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_chebyshev_polynomial_v_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_erfcx_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_modified_bessel_i1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_modified_bessel_k0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_polygamma_special_polygamma_n_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_scaled_modified_bessel_k0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_shifted_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_shifted_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_split_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_std_mean_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_sum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_svd_lowrank_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_t_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_tan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_true_divide_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_view_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_vsplit_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_zero__cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_H_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_NumpyCubeCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_NumpyTakeCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_NumpyViewCopyCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad___rmul___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad___rpow___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_arange_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_as_strided_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_as_strided_partial_views_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_asin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_atleast_2d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_block_diag_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cartesian_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cholesky_solve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_complex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_conj_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_conj_physical_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_corrcoef_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cos_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cos_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cumsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_diag_embed_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_diff_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_dist_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_div_floor_rounding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_div_trunc_rounding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_exp2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_hfft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_float_power_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_floor_divide_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_gradient_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_half_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_index_fill_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_int_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_isclose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_cross_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_det_singular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_eig_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_householder_product_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_lu_factor_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_matrix_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_pinv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_pinv_hermitian_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_tensorsolve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_vander_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linspace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log10_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_logical_xor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_long_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_mT_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_cumprod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_normalize_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_var_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_narrow_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_ne_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_neg_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_alpha_dropout_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_avg_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_batch_norm_without_cudnn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_dropout2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_feature_alpha_dropout_without_train_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_gelu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_grid_sample_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_l1_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_local_response_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pad_replicate_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pixel_shuffle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pixel_unshuffle_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_triplet_margin_with_distance_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_triplet_margin_with_distance_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nonzero_static_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_norm_inf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_norm_nuc_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_ones_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_polygamma_polygamma_n_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_quantile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_rand_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_randn_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_resize__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_resolve_neg_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_round_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_round_decimals_neg_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_rsqrt_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_sigmoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_signal_windows_hann_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_signbit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_sinc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_slice_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_softmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_erfcx_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_shifted_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_sqrt_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_std_mean_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_t_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_tan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_tanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_tensor_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_true_divide_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unique_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_var_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_view_as_complex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_view_as_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_zeros_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_NumpyNMSCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_T_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad__segment_reduce_lengths_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_addcmul_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_addmm_decomposed_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_all_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_atan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_bmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cartesian_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cfloat_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_char_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_chunk_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_clamp_min_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_clone_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_column_stack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_contiguous_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cumprod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_diagonal_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_empty_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_expand_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_ihfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_rfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_float_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_floor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_full_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_imag_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_put_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_put_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_inv_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_lu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_matrix_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_pinv_hermitian_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_pinv_singular_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linspace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_log1p_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logspace_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_long_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_lu_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_mT_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_amax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_amin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_cumsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_median_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_max_pool2d_with_indices_backward_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_meshgrid_variadic_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_meshgrid_variadic_tensors_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_multinomial_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_avg_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_conv_transpose2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_hardsigmoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_interpolate_trilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_kl_div_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_l1_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_pixel_unshuffle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_silu_complex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nonzero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_norm_inf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_normal_in_place_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_pow_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_randint_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_reshape_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_scatter_reduce_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_searchsorted_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_select_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_signal_windows_general_cosine_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_signal_windows_kaiser_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_modified_bessel_i0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_spherical_bessel_j0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_xlog1py_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_to_sparse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_unsafe_chunk_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_var_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_zero__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad__softmax_backward_data_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addbmm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addcmul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_amax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_argmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_as_strided_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_atan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_bool_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cat_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cat_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cfloat_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_char_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cholesky_inverse_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_clamp_min_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_clone_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_column_stack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_column_stack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_complex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_conj_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_corrcoef_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cosh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_count_nonzero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cov_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cumulative_trapezoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_diagonal_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_div_floor_rounding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_dsplit_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_einsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_erfinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_hfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_hfftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_ifft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_ifft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_ihfftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_flipud_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fmod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_frexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_gather_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_geometric_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_half_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_i0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_inner_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_isfinite_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_item_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_jiterator_unary_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_kron_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_ldexp_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_ldexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_lerp_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_ldl_factor_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_lstsq_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_lu_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_tensorsolve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_vector_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linspace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_log_softmax_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_long_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_argmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_max_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_median_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_meshgrid_list_of_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_min_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_mul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_mvlgamma_mvlgamma_p_5_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_narrow_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_narrow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_new_full_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_adaptive_avg_pool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_avg_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_conv_transpose2d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_embedding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_hardswish_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_hinge_embedding_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_interpolate_trilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_layer_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_max_unpool3d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_circular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_constant_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_silu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_triplet_margin_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nonzero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nonzero_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_norm_fro_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_outer_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_polar_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_positive_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_randn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_reciprocal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_renorm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_resize_as__cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_rot90_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_scalar_tensor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_scatter_reduce_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_sgn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_sigmoid_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_signal_windows_hamming_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_ndtri_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_split_list_args_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_mean_unbiased_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_svd_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_triangular_solve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unbind_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unfold_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unsqueeze_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_var_mean_unbiased_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_vsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_vstack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_vstack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_where_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_zeros_cuda_float64 2024-01-31T23:17:10.0750823Z 2024-01-31T23:17:10.0751281Z Running test_ops_gradients 9/10 ... [2024-01-31 23:17:10.029633] 2024-01-31T23:17:10.0752964Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_gradients.py', '--shard-id=9', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:17:10.029831] 2024-01-31T23:26:31.4729142Z 2024-01-31T23:26:31.4731445Z test_ops_gradients 9/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_gradients_9.10_054b80417cab068c_.log 2024-01-31T23:26:31.4995152Z Running 544 items in this shard: test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_NumpyCatCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_T_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad__native_batch_norm_legit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_angle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_argwhere_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_atan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_atan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_atleast_2d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_conj_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_contiguous_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_corrcoef_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cos_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cummin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cumprod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_cumprod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_diagflat_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_dist_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_double_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_equal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_erfc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_fftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_hfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_ihfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_irfft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_fft_irfftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_flatten_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_flipud_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_float_power_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_full_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_histc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_inner_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_kron_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_eigvals_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_eigvalsh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_ldl_factor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_ldl_factor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_lstsq_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_matrix_power_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_vander_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linalg_vector_norm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linspace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_linspace_tensor_overload_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_log_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_logspace_tensor_overload_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_cumsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_masked_scatter_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_max_reduction_no_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_max_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_median_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_meshgrid_variadic_tensors_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_minimum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mode_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_mul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_neg_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_adaptive_max_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_conv1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_cosine_similarity_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_feature_alpha_dropout_without_train_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_gelu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_hardswish_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_l1_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_layer_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_reflect_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_pad_replicate_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_relu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_softmin_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_softshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_nn_functional_tanhshrink_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_norm_nuc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_normal_in_place_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_randn_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_rot90_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_sgn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_signal_windows_hann_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_signal_windows_nuttall_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_slice_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_special_erfcx_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_split_with_sizes_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_squeeze_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_std_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_svd_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_t_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tanh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_transpose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_trapezoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_tril_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_unsqueeze_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_var_unbiased_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_zero__cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_fail_gradgrad_zeros_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_NumpyTakeCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___rmatmul___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad___rpow___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad__softmax_backward_data_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addbmm_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addcdiv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_addr_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_arange_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_as_strided_partial_views_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_atanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_block_diag_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_byte_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cholesky_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cholesky_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cholesky_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_clamp_min_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cos_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_cummin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_diagonal_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_diagonal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_dist_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_double_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_dsplit_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_einsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_empty_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_empty_strided_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_exp2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_expand_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_expand_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_fft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_fft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_fft_ihfftn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_floor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_gather_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_geometric_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_grid_sampler_2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_isclose_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_istft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_jiterator_4inputs_with_extra_args_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_kron_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_lgamma_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_cholesky_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_det_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_eigvals_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_eigvalsh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_inv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_lu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_matrix_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_pinv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_pinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_pinv_hermitian_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_slogdet_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_tensorsolve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_log2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_logical_or_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_logaddexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_normalize_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_masked_softmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_max_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_median_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_mul_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_mv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_narrow_copy_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_narrow_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_new_empty_strided_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_new_ones_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_adaptive_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_avg_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_conv3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_conv_transpose3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_cosine_embedding_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_interpolate_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_max_unpool1d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_max_unpool2d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_pixel_shuffle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nn_functional_triplet_margin_loss_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nonzero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_nonzero_static_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_normal_number_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_ones_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_permute_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_polygamma_polygamma_n_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_positive_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_pow_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_quantile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_resolve_conj_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_rot90_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_sinh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_sort_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_sparse_mm_reduce_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_xlog1py_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_special_zeta_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_square_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_square_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_squeeze_multiple_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_stack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_std_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_std_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_std_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_stft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_tanh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_tensor_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_to_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unbind_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unflatten_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unfold_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_var_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_view_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_view_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_addcdiv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_addcmul_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_addmm_decomposed_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_any_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_argmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_argsort_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_atleast_3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cdouble_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_copysign_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_corrcoef_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cross_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_cummax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_dsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_einsum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_erfc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_expm1_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_fft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_hfft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_ifft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fft_irfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fill_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_flatten_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_flipud_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_fmod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_gather_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_geometric_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_i0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_isfinite_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_isreal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_jiterator_unary_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_cond_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_cross_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_det_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_eigvalsh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_householder_product_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_inv_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_lu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_lu_factor_ex_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_lu_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_multi_dot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_tensorinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_linalg_vander_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_log_softmax_with_dtype_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_logical_not_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_logspace_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_lu_solve_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_lu_solve_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_argmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_fill_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_logaddexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_select_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_softmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_masked_softmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_max_reduction_with_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nanquantile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_avg_pool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_binary_cross_entropy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv_transpose1d_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_conv_transpose1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_cosine_embedding_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_feature_alpha_dropout_without_train_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_glu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_hardshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_leaky_relu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_logsigmoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_max_unpool1d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_max_unpool2d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_multilabel_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pad_circular_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pad_replicate_negative_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_pairwise_distance_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_prelu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_nn_functional_upsample_bilinear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_ones_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_ones_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_outer_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_permute_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_polygamma_polygamma_n_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_real_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_real_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_remainder_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_repeat_interleave_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_reshape_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_resolve_conj_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_roll_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_rsub_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_searchsorted_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_sgn_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_signal_windows_general_hamming_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_sin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_chebyshev_polynomial_u_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_entr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_ndtr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_split_list_args_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_square_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_stft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_take_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_trunc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_unsafe_chunk_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_view_as_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_where_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_gradgrad_zeros_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_NumpyNonzeroCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_NumpyTakeCustomOp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad___rmul___cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_acos_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_addmm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_allclose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_argmax_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_as_strided_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_atleast_2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_bfloat16_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_block_diag_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_broadcast_to_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cauchy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cdouble_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_ceil_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cholesky_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cholesky_inverse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_constant_pad_nd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_constant_pad_nd_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_contiguous_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_cumsum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_empty_strided_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_eq_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_equal_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_ifftshift_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fft_ihfft2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fill_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_flatten_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fliplr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_fmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_full_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_geqrf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_gradient_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_histc_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_hstack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_igammac_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_add_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_index_select_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_isin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_lgamma_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_cholesky_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_cond_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_cross_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_cross_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_det_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_eigh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_lu_factor_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_norm_subgradients_at_zero_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_qr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_solve_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_svdvals_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_linalg_vecdot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logaddexp2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logcumsumexp_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logical_and_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_logical_xor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_long_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_scatter_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_masked_std_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_min_reduction_no_dim_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_movedim_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_mvlgamma_mvlgamma_p_1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nansum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nansum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_new_empty_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_new_empty_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_adaptive_avg_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_adaptive_max_pool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_adaptive_max_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_avg_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_conv1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_group_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_interpolate_bicubic_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_interpolate_nearest-exact_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_interpolate_nearest_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_logsigmoid_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_max_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_max_unpool1d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_max_unpool2d_grad_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_multi_margin_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_pad_constant_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_pixel_shuffle_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_poisson_nll_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_softshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_nn_functional_threshold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_normal_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_normal_number_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_ormqr_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_polygamma_polygamma_n_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_polygamma_polygamma_n_2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_qr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_randn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_randn_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_real_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_remainder_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_repeat_interleave_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_resolve_neg_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_rsqrt_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_scatter_reduce_prod_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_select_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_sinh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_special_scaled_modified_bessel_k0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_split_list_args_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_sqrt_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_stack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_std_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_std_mean_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_std_mean_unbiased_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_std_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_sum_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_tensor_split_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_tile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_tril_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_unbind_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_var_mean_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_grad_zeros_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad___rdiv___cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_acosh_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addmm_decomposed_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_addmv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_atan2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_block_diag_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_bool_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cartesian_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cauchy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cdouble_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cfloat_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cholesky_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_conj_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cos_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cosh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_cummin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_diag_embed_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_div_trunc_rounding_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_dsplit_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_dstack_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_erf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_expand_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_exponential_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_fft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_fft_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_fftn_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fft_irfft2_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_flipud_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_float_power_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_fmin_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_full_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_geqrf_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_igammac_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_inner_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_isclose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_jiterator_binary_return_by_ref_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_cholesky_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_cholesky_ex_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_eigvalsh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_ldl_factor_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_ldl_factor_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_pinv_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_pinv_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_slogdet_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_svdvals_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_vecdot_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_linalg_vector_norm_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_log2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_logaddexp2_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_logspace_tensor_overload_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_long_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_lu_unpack_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_masked_prod_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_multinomial_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_mvlgamma_mvlgamma_p_1_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_narrow_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_avg_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_cosine_similarity_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_linear_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_linear_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_max_pool3d_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_multi_head_attention_forward_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_replicate_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_replicate_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_pad_replicate_negative_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_poisson_nll_loss_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_relu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_selu_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_softshrink_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_nn_functional_unfold_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_pinverse_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_polygamma_polygamma_n_3_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_positive_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_quantile_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_rand_like_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_randn_like_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_repeat_interleave_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_resolve_neg_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_roll_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_scatter_add_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_searchsorted_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_signal_windows_general_hamming_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_sinh_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_chebyshev_polynomial_t_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_entr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_hermite_polynomial_he_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_ndtr_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_polygamma_special_polygamma_n_0_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_shifted_chebyshev_polynomial_w_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_special_xlog1py_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_mean_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_std_mean_unbiased_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_stft_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_sum_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_svd_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tan_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tan_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tensor_split_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_tile_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_transpose_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_transpose_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_unfold_copy_cuda_float64, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_view_as_real_cuda_complex128, test/test_ops_gradients.py::TestBwdGradientsCUDA::test_inplace_gradgrad_xlogy_cuda_float64 2024-01-31T23:26:31.5228543Z 2024-01-31T23:26:31.5228874Z Running test_transformers 1/3 ... [2024-01-31 23:26:31.473709] 2024-01-31T23:26:31.5230655Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_transformers.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:26:31.473903] 2024-01-31T23:36:39.0698049Z 2024-01-31T23:36:39.0699273Z test_transformers 1/3 was successful, full logs can be found in artifacts with path test/test-reports/test_transformers_1.3_bf2b326bc96b49ef_.log 2024-01-31T23:36:41.3049978Z Running 25188 items in this shard: test/test_transformers.py::TestTransformersCUDA::test_bias_is_none_cuda, test/test_transformers.py::TestTransformersCUDA::test_encoder_is_causal_cuda, test/test_transformers.py::TestTransformersCUDA::test_encoder_padding_and_src_mask_bool_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_1_bias_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_1_bias_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_8_bias_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim2_key_padding_mask_dim1_float32_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim_3_key_padding_mask_dim1_float32_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim_3_key_padding_mask_dim_2_bool_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_3D_input_dim_3D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_3D_input_dim_3D_attn_mask_dropout_p_0_2_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_2D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_2D_attn_mask_dropout_p_0_5_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_4D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_4D_causal_attn_mask_dropout_p_0_2_cuda, test/test_transformers.py::TestTransformersCUDA::test_self_attn_TxT_attn_mask_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_False_training_False_enable_nested_tensor_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_False_training_True_enable_nested_tensor_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_True_training_False_enable_nested_tensor_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_False_d_model_256_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_square_input_with_no_grad_True_training_False_enable_nested_tensor_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoderlayer_src_mask_nhead_1_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoderlayer_src_mask_nhead_4_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_flash_autocast_fp32_float16_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_invalid_dtype_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_1_dimensional_inputs_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel2_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_devices_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_sequence_lengths_kernel1_cuda, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_dense_dropout_0_7_bfloat16_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_nested_dropout_0_0_bfloat16_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_nested_dropout_0_7_float32_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_12_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_1179_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_2_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_16_mask_dim_4_bool_mask_1_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_1_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_16_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_1030_n_head_1_head_dim_16_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_1030_n_head_3_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_1_head_dim_8_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_sdp_math_gradcheck_contiguous_inputs_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_query_dense_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_seq_len_1_inputs_fused_kernel0_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_seq_len_1_inputs_fused_kernel1_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_eff_attention_pad_mask_float16_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attetntion_mask_variants_mask_dim_3_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attetntion_mask_variants_mask_dim_4_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_scaled_dot_product_attention_fused_kernels_packed_accuracy_type_nested_fused_kernel1_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_scaled_dot_product_attention_fused_kernels_packed_type_nested_is_contiguous_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_sdp_flash_attention_grad_against_math_contiguous_inputs_True_is_causal_True_float16_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_sdp_mem_efficient_grad_against_math_contiguous_inputs_True_is_causal_False_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda, test/test_transformers.py::TestAttnMasksCUDA::test_is_causal_equals_upper_left_shape2_cuda 2024-01-31T23:36:43.4396082Z 2024-01-31T23:36:43.4396574Z Running test_ops_fwd_gradients 2/10 ... [2024-01-31 23:36:39.135379] 2024-01-31T23:36:43.4398341Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '--shard-id=2', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:36:39.135650] 2024-01-31T23:46:19.1252792Z 2024-01-31T23:46:19.1254340Z test_ops_fwd_gradients 2/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_2.10_66d64e346e9df668_.log 2024-01-31T23:46:19.1415618Z Running 334 items in this shard: test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_T_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad___rdiv___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_add_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_argmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_atleast_2d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_atleast_3d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_baddbmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cdouble_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ceil_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_chalf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_chunk_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_combinations_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_conj_physical_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cos_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cosh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_count_nonzero_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cumulative_trapezoid_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_diagonal_copy_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_div_no_rounding_mode_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_einsum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_exp_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_hfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_ifft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_ifftshift_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_irfft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_irfftn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_rfftn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_flipud_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_full_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_half_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_index_reduce_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_isreal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_kron_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ldexp_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_lerp_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_diagonal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_ldl_factor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_matrix_rank_hermitian_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_pinv_singular_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_slogdet_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_tensorinv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_tensorsolve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_log_softmax_with_dtype_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logaddexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logical_and_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_long_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_lu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_mH_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_cumsum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_normalize_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_sum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_meshgrid_list_of_tensors_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_new_full_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_new_zeros_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_cosine_embedding_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_dropout3d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_hardsigmoid_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_huber_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_instance_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_interpolate_nearest_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_interpolate_trilinear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_kl_div_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_normalize_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_pad_circular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_tanhshrink_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_threshold_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nonzero_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_normal_number_mean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ormqr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_pinverse_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_polar_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_polygamma_polygamma_n_0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_pow_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_quantile_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_randn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ravel_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_resize_as__cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_rsqrt_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_rsqrt_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_scalar_tensor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_short_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_short_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_signal_windows_hamming_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sparse_mm_reduce_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_hermite_polynomial_h_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_modified_bessel_i1_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_split_list_args_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_split_list_args_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_square_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_squeeze_multiple_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_stack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_std_mean_unbiased_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_take_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_tile_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_trace_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_trapezoid_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_triangular_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_triu_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unique_consecutive_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unsafe_chunk_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_zeros_like_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_T_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD__upsample_bilinear2d_aa_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_acosh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_addbmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_addcdiv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_all_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_aminmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_as_strided_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atan2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atan_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_baddbmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_block_diag_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_bool_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cartesian_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_combinations_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_contiguous_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_diagonal_copy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_diagonal_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_dsplit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_einsum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_empty_like_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_eq_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_equal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_erf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_exp2_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_expm1_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_fftshift_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ifftn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ifftn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_irfft2_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_rfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_flatten_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_half_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_half_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_heaviside_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_imag_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_index_select_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_isinf_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_isposinf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_item_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_item_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_jiterator_4inputs_with_extra_args_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_le_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_cholesky_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_eig_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_inv_ex_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_lu_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_solve_ex_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_solve_triangular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_svdvals_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_vander_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log_softmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log_softmax_with_dtype_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logaddexp2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logical_and_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logspace_tensor_overload_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_meshgrid_list_of_tensors_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mvlgamma_mvlgamma_p_3_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_neg_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_ctc_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_embedding_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_feature_alpha_dropout_without_train_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_group_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_hardswish_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_hinge_embedding_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_unpool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_unpool2d_grad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_unpool3d_grad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_mse_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_multi_head_attention_forward_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_pad_replicate_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_pixel_unshuffle_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_softsign_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_outer_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_pinverse_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_prod_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_put_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_ravel_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_reshape_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_scalar_tensor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_searchsorted_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_select_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_signal_windows_general_hamming_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_sparse_mm_reduce_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_sparse_sampled_addmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_erfcx_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_i1_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_i1e_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_modified_bessel_k1_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_stft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_sum_to_size_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_svd_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_svd_lowrank_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tan_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tensor_split_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tile_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_true_divide_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_true_divide_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_uniform_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_var_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_var_mean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_vdot_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_view_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_zeros_like_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD___getitem___cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD___rpow___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD__native_batch_norm_legit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD__softmax_backward_data_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD__upsample_bilinear2d_aa_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_add_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_addcdiv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_addmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_as_strided_scatter_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_atan_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_atleast_1d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_atleast_1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cartesian_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_chalf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cholesky_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_chunk_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_clamp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_constant_pad_nd_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cos_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cumsum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_diff_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_dot_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_dstack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_dstack_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_equal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_exp2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_fftn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_irfft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_irfftn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_flatten_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_float_power_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_float_power_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fmod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_full_like_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_hsplit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_index_add_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_index_reduce_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_inner_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_isclose_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_isinf_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_jiterator_4inputs_with_extra_args_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_jiterator_binary_return_by_ref_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_jiterator_unary_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_kron_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_eigvalsh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_inv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_ldl_factor_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_matrix_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_multi_dot_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_pinv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_solve_ex_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_svdvals_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_vecdot_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linspace_tensor_overload_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_log1p_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logaddexp2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logaddexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logdet_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logical_or_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logical_xor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_lt_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_lu_unpack_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_amin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_select_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_var_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_matmul_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_median_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nansum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_ne_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_neg_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_adaptive_max_pool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_conv1d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_conv1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_cosine_similarity_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_embedding_bag_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_feature_alpha_dropout_without_train_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_hinge_embedding_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_max_pool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_multi_head_attention_forward_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pad_circular_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pad_circular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pad_constant_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pad_replicate_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pairwise_distance_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pixel_shuffle_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_softsign_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_triplet_margin_loss_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_triplet_margin_with_distance_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_unfold_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_inf_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_normal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_normal_in_place_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_outer_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_rad2deg_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_randn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_randn_like_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_scatter_reduce_amax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_searchsorted_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_select_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_signal_windows_general_hamming_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_signal_windows_kaiser_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sinh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_softmax_with_dtype_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_laguerre_polynomial_l_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_square_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sub_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sum_to_size_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_triangular_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_triu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_unflatten_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_view_as_real_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_view_cuda_float64 2024-01-31T23:46:19.1573288Z 2024-01-31T23:46:19.1573722Z Running test_ops_fwd_gradients 6/10 ... [2024-01-31 23:46:19.125937] 2024-01-31T23:46:19.1575547Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '--shard-id=6', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:46:19.126126] 2024-01-31T23:57:13.3764195Z 2024-01-31T23:57:13.3766531Z test_ops_fwd_gradients 6/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_6.10_8a3cf8042cc3d719_.log 2024-01-31T23:57:13.3921741Z Running 317 items in this shard: test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad___getitem___cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_addcmul_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_addmv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_addr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_all_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_allclose_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_amax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_argwhere_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_as_strided_partial_views_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_as_strided_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_atanh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cartesian_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_cholesky_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_conj_physical_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_count_nonzero_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_diagonal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_digamma_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_dist_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_dot_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_empty_permuted_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_exp2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_expand_as_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_fft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_hfft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_ihfft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_float_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_float_power_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_floor_divide_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_frac_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_geometric_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_gradient_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_grid_sampler_2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_hsplit_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_i0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_isfinite_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_isreal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_item_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_item_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_jiterator_2inputs_2outputs_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_jiterator_4inputs_with_extra_args_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_jiterator_binary_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_kron_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_cross_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_eigvals_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_ldl_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_matrix_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_norm_subgradients_at_zero_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_solve_triangular_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_tensorsolve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_vander_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_vander_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_vector_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logdet_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logical_or_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_lu_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_argmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_softmin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_mode_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_new_ones_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_adaptive_avg_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_alpha_dropout_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_batch_norm_without_cudnn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_conv2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_conv_transpose2d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_hardswish_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_l1_loss_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_layer_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_linear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_max_pool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_max_unpool1d_grad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_grad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_mish_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_pad_reflect_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_pixel_shuffle_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_pixel_unshuffle_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_relu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_silu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_soft_margin_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_triplet_margin_with_distance_loss_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_upsample_bilinear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ones_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_polygamma_polygamma_n_2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_polygamma_polygamma_n_3_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_positive_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_put_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_remainder_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_resize__cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_resolve_conj_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_rot90_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_scatter_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sgn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_signal_windows_cosine_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_signal_windows_gaussian_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_signal_windows_general_hamming_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_signal_windows_kaiser_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_softmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sparse_sampled_addmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_bessel_j0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_chebyshev_polynomial_u_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_laguerre_polynomial_l_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_log_ndtr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_modified_bessel_i0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_polygamma_special_polygamma_n_0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_take_along_dim_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_tanh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_tensordot_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_to_sparse_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unbind_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unfold_copy_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_var_mean_unbiased_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_view_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_where_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_H_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_abs_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_acos_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_acosh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_add_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_all_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_amin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_any_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_asin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atan_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atanh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atleast_1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atleast_3d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cartesian_prod_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cauchy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_clamp_max_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_clone_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_column_stack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cosh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_count_nonzero_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_diagonal_scatter_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_div_no_rounding_mode_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_dot_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_expand_as_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_hfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ihfft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_flatten_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fmod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_geometric_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_hsplit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_index_put_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_isfinite_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_istft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_jiterator_binary_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_kron_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log_normal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logcumsumexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logical_not_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logical_xor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_sum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_matmul_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_max_pool2d_with_indices_backward_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_max_reduction_no_dim_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_meshgrid_variadic_tensors_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nanquantile_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nansum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_narrow_copy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_new_empty_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_new_full_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_conv2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_conv_transpose1d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_conv_transpose1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_cross_entropy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_embedding_bag_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_fractional_max_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_interpolate_bicubic_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_margin_ranking_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_pool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_pool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_max_unpool2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_pad_constant_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_pdist_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_triplet_margin_with_distance_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nonzero_static_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_norm_fro_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_pca_lowrank_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_qr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_rand_like_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_reciprocal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_renorm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_repeat_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_roll_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_round_decimals_0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_round_decimals_neg_3_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_scalar_tensor_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_scatter_add_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_select_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_signal_windows_blackman_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_softmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_softmax_with_dtype_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_scaled_modified_bessel_k0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_split_list_args_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_std_mean_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tanh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tensordot_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_to_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_trapz_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_tril_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unbind_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unflatten_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unsafe_chunk_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unsafe_chunk_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unsqueeze_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_var_mean_unbiased_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_vdot_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_argwhere_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_as_strided_partial_views_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_baddbmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_bfloat16_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_block_diag_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_constant_pad_nd_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_empty_like_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_empty_strided_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_empty_strided_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_eq_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_equal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_erf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_exp_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_expand_as_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_exponential_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_hfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_hfftn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_floor_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_frexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_index_select_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_isfinite_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_isreal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_item_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_jiterator_2inputs_2outputs_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_jiterator_binary_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_kron_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_kthvalue_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_det_singular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_eig_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_eigh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_eigvals_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_vector_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_log_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logical_not_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logical_or_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_logit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_mT_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_normalize_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_std_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_sum_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_max_binary_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_meshgrid_list_of_tensors_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_mv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_narrow_copy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_narrow_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_new_empty_strided_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_new_empty_strided_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_binary_cross_entropy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_conv2d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_conv_transpose1d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_conv_transpose3d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_cosine_embedding_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_hardshrink_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_hardswish_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_local_response_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_max_pool3d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pixel_unshuffle_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_poisson_nll_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_silu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_soft_margin_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_softmin_with_dtype_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_triplet_margin_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_triplet_margin_with_distance_loss_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_upsample_nearest_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_fro_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_nuc_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_ones_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_ormqr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_pow_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_put_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_real_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_rsqrt_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_select_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_select_scatter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_signal_windows_cosine_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_signbit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sinh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sparse_mm_reduce_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sparse_sampled_addmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_bessel_j1_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_bessel_y0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_modified_bessel_i0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_split_with_sizes_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_split_with_sizes_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_stack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_std_mean_unbiased_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_take_along_dim_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_tanh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_tile_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_trace_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_true_divide_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_zeros_like_cuda_float64 2024-01-31T23:57:13.4071387Z 2024-01-31T23:57:13.4071822Z Running test_ops_fwd_gradients 10/10 ... [2024-01-31 23:57:13.377285] 2024-01-31T23:57:13.4073642Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '--shard-id=10', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-01-31 23:57:13.377492] 2024-02-01T00:06:42.8926922Z 2024-02-01T00:06:42.8928543Z test_ops_fwd_gradients 10/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_10.10_b6b87c8eb18e0c60_.log 2024-02-01T00:06:42.9059201Z Running 276 items in this shard: test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad___rpow___cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad___rsub___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad__native_batch_norm_legit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad__softmax_backward_data_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_acos_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_addbmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_baddbmm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_block_diag_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_bmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_broadcast_tensors_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_bucketize_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_chunk_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_constant_pad_nd_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_contiguous_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_corrcoef_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_diagflat_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_div_trunc_rounding_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_einsum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_empty_strided_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_eq_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_equal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_expand_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_fftn_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_ifft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_ihfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fft_irfftn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_flatten_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_flip_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_fliplr_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ge_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_gradient_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_index_add_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_cross_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_det_singular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_diagonal_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_eigvalsh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_eigvalsh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_matrix_rank_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_linalg_pinv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logical_or_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logical_xor_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_logit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_amin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_masked_normalize_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_mm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_mul_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nanmean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_narrow_copy_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_narrow_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_native_layer_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_ne_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_neg_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_new_full_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_new_ones_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_grid_sample_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_group_norm_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_pad_replicate_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_nn_functional_softplus_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_norm_fro_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_pow_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_real_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_reciprocal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_repeat_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_repeat_interleave_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_round_decimals_neg_3_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_rsub_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_scatter_reduce_amax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sign_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_slice_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_shifted_chebyshev_polynomial_u_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_special_spherical_bessel_j0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_squeeze_multiple_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_std_mean_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sub_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_sum_to_size_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_svd_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_tensor_split_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_tensor_split_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_transpose_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_transpose_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_triangular_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unflatten_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unfold_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_uniform_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unique_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_unsqueeze_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_var_mean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_fn_fwgrad_bwgrad_view_copy_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD___getitem___cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD___getitem___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD___rdiv___cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD___rmul___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD__native_batch_norm_legit_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_addmv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_addmv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_allclose_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_atleast_3d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_bmm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_broadcast_to_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_chunk_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_clamp_min_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_conj_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_contiguous_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_corrcoef_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_count_nonzero_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cross_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cumprod_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cumprod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_deg2rad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_diag_embed_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_diff_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_div_no_rounding_mode_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_empty_permuted_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_empty_permuted_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_fft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ifft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ifftshift_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_ihfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fft_irfft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_fill_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_full_like_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_ge_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_geqrf_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_geqrf_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_igamma_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_index_put_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_isin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_jiterator_unary_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_diagonal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_eigh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_lstsq_grad_oriented_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_matrix_norm_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_log_softmax_with_dtype_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logdet_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_logsumexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_long_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_lt_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mH_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_cumsum_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_normalize_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_masked_prod_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_meshgrid_variadic_tensors_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_movedim_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_mul_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nanmean_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_narrow_copy_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_adaptive_avg_pool1d_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_conv2d_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_gelu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_kl_div_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_l1_loss_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_leaky_relu_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_logsigmoid_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_normalize_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_pairwise_distance_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_relu6_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_softmin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_softsign_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_threshold_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_nn_functional_upsample_nearest_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_norm_nuc_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_permute_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_polygamma_polygamma_n_0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_polygamma_polygamma_n_3_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_positive_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_pow_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_resize_as__cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_sgn_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_bessel_j0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_polygamma_special_polygamma_n_0_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_special_zeta_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_square_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_stack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_sub_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_trapezoid_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_unique_consecutive_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_view_as_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD___radd___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD___rmul___cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD__segment_reduce_offsets_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_addmv_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_addmv_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_allclose_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_amin_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_angle_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_any_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_as_strided_partial_views_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_asin_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_atanh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_broadcast_to_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cholesky_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_clamp_max_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_clone_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_column_stack_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_column_stack_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_cov_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_diagflat_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_diagonal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_fft2_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_fft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_fftshift_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_hfft2_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_hfft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_ifft_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fft_ihfft2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_fliplr_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_gradient_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_index_add_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_inner_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_int_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_isreal_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_istft_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_cholesky_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_det_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_det_singular_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_eigh_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_inv_ex_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_pinv_singular_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_qr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linalg_solve_ex_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_linspace_tensor_overload_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_log1p_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_long_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_lu_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_argmax_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_cumprod_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_logaddexp_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_prod_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_masked_scatter_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_meshgrid_list_of_tensors_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_meshgrid_variadic_tensors_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_mode_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_multinomial_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_narrow_copy_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_new_zeros_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nextafter_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_bilinear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_hardtanh_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_linear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_max_unpool3d_grad_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_mish_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_pad_replicate_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_nn_functional_upsample_bilinear_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_norm_nuc_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_ones_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_ormqr_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_polygamma_polygamma_n_2_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_rand_like_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_repeat_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_resize_as__cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_round_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_sigmoid_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_signal_windows_hann_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_i0e_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_i1e_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_modified_bessel_k1_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_ndtr_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_special_xlog1py_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_squeeze_multiple_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_std_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_take_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_tensor_split_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_trapezoid_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_tril_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_triu_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_uniform_cuda_complex128, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_unique_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_var_mean_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_inplace_forward_mode_AD_where_cuda_complex128 2024-02-01T00:06:42.9186786Z 2024-02-01T00:06:42.9187223Z Running test_cuda_multigpu 1/1 ... [2024-02-01 00:06:42.893372] 2024-02-01T00:06:42.9188897Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_multigpu.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:06:42.893566] 2024-02-01T00:06:45.8626000Z 2024-02-01T00:06:45.8627231Z test_cuda_multigpu 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_multigpu_1.1_bfbba6a84a5d5188_.log 2024-02-01T00:06:45.8648577Z Running 62 items in this shard: test/test_cuda_multigpu.py::TestCudaMultiGPU::test_autogpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_caching_pinned_memory_multi_gpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cat_autogpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_copy_device, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_copy_streams, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_init_race, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_memory_leak_detection, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_set_device, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_synchronize, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_current_stream, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_default_stream, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_events_multi_gpu_elapsed_time, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_events_multi_gpu_query, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_events_wait, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_external_streams, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_external_streams_multi_device, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_get_set_rng_state_all, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_grad_scaling_device_as_key, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_grad_scaling_multigpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_grad_scaling_scale, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_grad_scaling_unscale, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_load_nonexistent_device, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_mem_get_info, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_memory_stats, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_memory_stats_multigpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_multigpu_serialization_remap, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_multigpu_serialization_remap_dict, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_multigpu_storage_clone, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_new, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_rng_state_offset, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_stream_context, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_stream_event_device, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_stream_event_nogil, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_streaming_backwards_device_transfer, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_streams_multi_gpu, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_streams_multi_gpu_eq, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_streams_multi_gpu_query, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_streams_priority, test/test_cuda_multigpu.py::TestCudaMultiGPU::test_tensor_device, test/test_cuda_multigpu.py::TestCudaComm::test_broadcast_coalesced, test/test_cuda_multigpu.py::TestCudaComm::test_broadcast_coalesced_dense_only, test/test_cuda_multigpu.py::TestCudaComm::test_broadcast_coalesced_empty_tensors, test/test_cuda_multigpu.py::TestCudaComm::test_broadcast_cpu, test/test_cuda_multigpu.py::TestCudaComm::test_broadcast_gpu, test/test_cuda_multigpu.py::TestCudaComm::test_gather, test/test_cuda_multigpu.py::TestCudaComm::test_gather_dim, test/test_cuda_multigpu.py::TestCudaComm::test_gather_namedtuple, test/test_cuda_multigpu.py::TestCudaComm::test_gather_neg_dim, test/test_cuda_multigpu.py::TestCudaComm::test_memory_format_scatter_gather, test/test_cuda_multigpu.py::TestCudaComm::test_reduce_add, test/test_cuda_multigpu.py::TestCudaComm::test_reduce_add_coalesced, test/test_cuda_multigpu.py::TestCudaComm::test_reduce_add_coalesced_dense_only, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_cpu, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_cpu_dim, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_cpu_neg_dim, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_cpu_sizes, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_gpu, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_gpu_dim, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_gpu_neg_dim, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_gpu_sizes, test/test_cuda_multigpu.py::TestCudaComm::test_scatter_namedtuple 2024-02-01T00:06:45.8668288Z 2024-02-01T00:06:45.8668583Z Running test_nn 1/1 ... [2024-02-01 00:06:45.862726] 2024-02-01T00:06:45.8670162Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_nn.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:06:45.862926] 2024-02-01T00:15:51.6974824Z 2024-02-01T00:15:51.6976004Z test_nn 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_nn_1.1_94ac28c60a1ac4cb_.log 2024-02-01T00:15:51.8466730Z Running 2327 items in this shard: test/test_nn.py::TestNN::test_AdaptiveLogSoftmax, test/test_nn.py::TestNN::test_AdaptiveLogSoftmax_cuda_fp32, test/test_nn.py::TestNN::test_AdaptiveLogSoftmax_cuda_tf32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_none, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_BCELoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_BCELoss_no_reduce, test/test_nn.py::TestNN::test_BCELoss_no_reduce_cuda, test/test_nn.py::TestNN::test_BCELoss_no_reduce_scalar, test/test_nn.py::TestNN::test_BCELoss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce, test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_cuda, test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_scalar, test/test_nn.py::TestNN::test_BCELoss_weights_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_legacy_enum, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_legacy_enum_cuda, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_reduce, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_reduce_scalar, test/test_nn.py::TestNN::test_BCEWithLogitsLoss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_CELU_no_batch_dim, test/test_nn.py::TestNN::test_CELU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_CTCLoss_critical_target_len, test/test_nn.py::TestNN::test_CTCLoss_lengthchecks_cpu, test/test_nn.py::TestNN::test_CTCLoss_lengthchecks_cuda, test/test_nn.py::TestNN::test_CTCLoss_long_targets, test/test_nn.py::TestNN::test_CTCLoss_typechecks, test/test_nn.py::TestNN::test_CTCLoss_zero_infinity, test/test_nn.py::TestNN::test_Conv1d, test/test_nn.py::TestNN::test_Conv1d_circular_stride2_pad2, test/test_nn.py::TestNN::test_Conv1d_circular_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_circular_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_dilated, test/test_nn.py::TestNN::test_Conv1d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_groups, test/test_nn.py::TestNN::test_Conv1d_groups_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_groups_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad1, test/test_nn.py::TestNN::test_Conv1d_pad1_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad1_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad1size1, test/test_nn.py::TestNN::test_Conv1d_pad1size1_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad1size1_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad2, test/test_nn.py::TestNN::test_Conv1d_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad2size1, test/test_nn.py::TestNN::test_Conv1d_pad2size1_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad2size1_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad_same, test/test_nn.py::TestNN::test_Conv1d_pad_same2, test/test_nn.py::TestNN::test_Conv1d_pad_same2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad_same2_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad_same_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad_same_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad_same_dilated, test/test_nn.py::TestNN::test_Conv1d_pad_same_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad_same_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_pad_valid, test/test_nn.py::TestNN::test_Conv1d_pad_valid_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_pad_valid_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_reflect_stride2_pad2, test/test_nn.py::TestNN::test_Conv1d_reflect_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_reflect_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_replicate_stride2_pad2, test/test_nn.py::TestNN::test_Conv1d_replicate_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_replicate_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_stride, test/test_nn.py::TestNN::test_Conv1d_stride_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_stride_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_zero_batch, test/test_nn.py::TestNN::test_Conv1d_zero_batch_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_zero_batch_cuda_tf32, test/test_nn.py::TestNN::test_Conv1d_zeros_stride2_pad2, test/test_nn.py::TestNN::test_Conv1d_zeros_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv1d_zeros_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d, test/test_nn.py::TestNN::test_Conv2d_circular_stride2_pad2, test/test_nn.py::TestNN::test_Conv2d_circular_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_circular_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_depthwise, test/test_nn.py::TestNN::test_Conv2d_depthwise_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_depthwise_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_depthwise_dilated, test/test_nn.py::TestNN::test_Conv2d_depthwise_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_depthwise_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_depthwise_padded, test/test_nn.py::TestNN::test_Conv2d_depthwise_padded_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_depthwise_padded_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_depthwise_strided, test/test_nn.py::TestNN::test_Conv2d_depthwise_strided_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_depthwise_strided_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_depthwise_with_multiplier, test/test_nn.py::TestNN::test_Conv2d_depthwise_with_multiplier_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_depthwise_with_multiplier_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_dilated, test/test_nn.py::TestNN::test_Conv2d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_dilated_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_dilated_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_dilated_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_groups, test/test_nn.py::TestNN::test_Conv2d_groups_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_groups_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_groups_thnn, test/test_nn.py::TestNN::test_Conv2d_groups_thnn_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_groups_thnn_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_groups_thnn_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_groups_thnn_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_groups_thnn_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_groups_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_groups_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_groups_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_no_bias, test/test_nn.py::TestNN::test_Conv2d_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_no_bias_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_no_bias_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_no_bias_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_pad_same, test/test_nn.py::TestNN::test_Conv2d_pad_same_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_pad_same_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_pad_same_dilated, test/test_nn.py::TestNN::test_Conv2d_pad_same_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_pad_same_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_pad_valid, test/test_nn.py::TestNN::test_Conv2d_pad_valid_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_pad_valid_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_padding, test/test_nn.py::TestNN::test_Conv2d_padding_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_padding_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_padding_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_padding_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_padding_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_reflect_stride2_pad2, test/test_nn.py::TestNN::test_Conv2d_reflect_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_reflect_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_replicate_stride2_pad2, test/test_nn.py::TestNN::test_Conv2d_replicate_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_replicate_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_strided, test/test_nn.py::TestNN::test_Conv2d_strided_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_strided_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_strided_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_strided_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_strided_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_zero_batch, test/test_nn.py::TestNN::test_Conv2d_zero_batch_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_zero_batch_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_zero_batch_with_long_tensor, test/test_nn.py::TestNN::test_Conv2d_zero_batch_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_zero_batch_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv2d_zeros_stride2_pad2, test/test_nn.py::TestNN::test_Conv2d_zeros_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv2d_zeros_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_1x1x1_no_bias_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_circular_stride2_pad2, test/test_nn.py::TestNN::test_Conv3d_circular_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_circular_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_dilated, test/test_nn.py::TestNN::test_Conv3d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_dilated_strided, test/test_nn.py::TestNN::test_Conv3d_dilated_strided_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_dilated_strided_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_groups, test/test_nn.py::TestNN::test_Conv3d_groups_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_groups_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_groups_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_groups_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_groups_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_no_bias, test/test_nn.py::TestNN::test_Conv3d_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_no_bias_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_no_bias_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_no_bias_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_pad_same, test/test_nn.py::TestNN::test_Conv3d_pad_same_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_pad_same_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_pad_same_dilated, test/test_nn.py::TestNN::test_Conv3d_pad_same_dilated_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_pad_same_dilated_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_pad_valid, test/test_nn.py::TestNN::test_Conv3d_pad_valid_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_pad_valid_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_replicate_stride2_pad2, test/test_nn.py::TestNN::test_Conv3d_replicate_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_replicate_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_stride, test/test_nn.py::TestNN::test_Conv3d_stride_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_stride_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_stride_padding, test/test_nn.py::TestNN::test_Conv3d_stride_padding_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_stride_padding_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_stride_padding_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_stride_padding_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_stride_padding_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_stride_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_stride_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_stride_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_zero_batch, test/test_nn.py::TestNN::test_Conv3d_zero_batch_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_zero_batch_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_zero_batch_with_long_tensor, test/test_nn.py::TestNN::test_Conv3d_zero_batch_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_zero_batch_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_Conv3d_zeros_stride2_pad2, test/test_nn.py::TestNN::test_Conv3d_zeros_stride2_pad2_cuda_fp32, test/test_nn.py::TestNN::test_Conv3d_zeros_stride2_pad2_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose1d, test/test_nn.py::TestNN::test_ConvTranspose1d_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose1d_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose1d_dilated, test/test_nn.py::TestNN::test_ConvTranspose1d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose1d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose1d_groups, test/test_nn.py::TestNN::test_ConvTranspose1d_groups_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose1d_groups_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose1d_no_bias, test/test_nn.py::TestNN::test_ConvTranspose1d_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose1d_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d, test/test_nn.py::TestNN::test_ConvTranspose2d_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated_with_long_tensor, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_dilated_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_groups, test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_groups_with_long_tensor, test/test_nn.py::TestNN::test_ConvTranspose2d_groups_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_groups_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias_with_long_tensor, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_no_bias_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose2d_with_long_tensor, test/test_nn.py::TestNN::test_ConvTranspose2d_with_long_tensor_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose2d_with_long_tensor_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose3d, test/test_nn.py::TestNN::test_ConvTranspose3d_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose3d_cuda_tf32, test/test_nn.py::TestNN::test_ConvTranspose3d_dilated, test/test_nn.py::TestNN::test_ConvTranspose3d_dilated_cuda_fp32, test/test_nn.py::TestNN::test_ConvTranspose3d_dilated_cuda_tf32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_CosineEmbeddingLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_CrossMapLRN2d, test/test_nn.py::TestNN::test_CrossMapLRN2d_cuda, test/test_nn.py::TestNN::test_ELU_no_batch_dim, test/test_nn.py::TestNN::test_ELU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Embedding, test/test_nn.py::TestNN::test_EmbeddingBag_discontiguous, test/test_nn.py::TestNN::test_EmbeddingBag_discontiguous_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_max, test/test_nn.py::TestNN::test_EmbeddingBag_max_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_max_padding_idx, test/test_nn.py::TestNN::test_EmbeddingBag_max_padding_idx_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_mean, test/test_nn.py::TestNN::test_EmbeddingBag_mean_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_mean_padding_idx, test/test_nn.py::TestNN::test_EmbeddingBag_mean_padding_idx_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_sparse, test/test_nn.py::TestNN::test_EmbeddingBag_sparse_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_sum, test/test_nn.py::TestNN::test_EmbeddingBag_sum_cuda, test/test_nn.py::TestNN::test_EmbeddingBag_sum_padding_idx, test/test_nn.py::TestNN::test_EmbeddingBag_sum_padding_idx_cuda, test/test_nn.py::TestNN::test_Embedding_cuda, test/test_nn.py::TestNN::test_Embedding_discontiguous, test/test_nn.py::TestNN::test_Embedding_discontiguous_cuda, test/test_nn.py::TestNN::test_Embedding_sparse, test/test_nn.py::TestNN::test_Embedding_sparse_cuda, test/test_nn.py::TestNN::test_Flatten, test/test_nn.py::TestNN::test_Flatten_cuda, test/test_nn.py::TestNN::test_Flatten_no_batch_dim, test/test_nn.py::TestNN::test_Flatten_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Fold, test/test_nn.py::TestNN::test_Fold_cuda, test/test_nn.py::TestNN::test_Fold_int_input, test/test_nn.py::TestNN::test_Fold_int_input_cuda, test/test_nn.py::TestNN::test_Fold_no_batch_dim_input, test/test_nn.py::TestNN::test_Fold_no_batch_dim_input_cuda, test/test_nn.py::TestNN::test_Fold_no_batch_dim_int_input, test/test_nn.py::TestNN::test_Fold_no_batch_dim_int_input_cuda, test/test_nn.py::TestNN::test_GELU_no_batch_dim, test/test_nn.py::TestNN::test_GELU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_GLU_no_batch_dim, test/test_nn.py::TestNN::test_GLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Hardshrink_no_batch_dim, test/test_nn.py::TestNN::test_Hardshrink_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Hardsigmoid_no_batch_dim, test/test_nn.py::TestNN::test_Hardsigmoid_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Hardswish_no_batch_dim, test/test_nn.py::TestNN::test_Hardswish_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Hardtanh_no_batch_dim, test/test_nn.py::TestNN::test_Hardtanh_no_batch_dim_cuda, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_margin_no_reduce, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_margin_no_reduce_cuda, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_reduce, test/test_nn.py::TestNN::test_HingeEmbeddingLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_HuberLoss_delta, test/test_nn.py::TestNN::test_HuberLoss_delta_cuda, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_HuberLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_KLDivLoss_batch_mean, test/test_nn.py::TestNN::test_KLDivLoss_batch_mean_log_target, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_KLDivLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_log_target, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_log_target_cuda, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_scalar, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_scalar_log_target, test/test_nn.py::TestNN::test_KLDivLoss_no_reduce_scalar_log_target_cuda, test/test_nn.py::TestNN::test_KLDivLoss_with_log_target_no_reduce, test/test_nn.py::TestNN::test_KLDivLoss_with_log_target_no_reduce_cuda, test/test_nn.py::TestNN::test_KLDivLoss_with_target_no_reduce, test/test_nn.py::TestNN::test_KLDivLoss_with_target_no_reduce_cuda, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_mean, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_none, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_sum, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_L1Loss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_L1Loss_no_reduce, test/test_nn.py::TestNN::test_L1Loss_no_reduce_complex, test/test_nn.py::TestNN::test_L1Loss_no_reduce_complex_cuda, test/test_nn.py::TestNN::test_L1Loss_no_reduce_cuda, test/test_nn.py::TestNN::test_L1Loss_no_reduce_scalar, test/test_nn.py::TestNN::test_L1Loss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_LSTM_cell, test/test_nn.py::TestNN::test_LSTM_cell_forward_hidden_size, test/test_nn.py::TestNN::test_LSTM_cell_forward_input_size, test/test_nn.py::TestNN::test_LayerNorm_3d_no_affine_large_feature, test/test_nn.py::TestNN::test_LayerNorm_3d_no_affine_large_feature_cuda, test/test_nn.py::TestNN::test_LayerNorm_3d_no_affine_large_feature_eval, test/test_nn.py::TestNN::test_LayerNorm_3d_no_affine_large_feature_eval_cuda, test/test_nn.py::TestNN::test_LeakyReLU_no_batch_dim, test/test_nn.py::TestNN::test_LeakyReLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Linear, test/test_nn.py::TestNN::test_Linear_cuda_fp32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_no_batch_dim, test/test_nn.py::TestNN::test_Linear_no_batch_dim_cuda_fp32, test/test_nn.py::TestNN::test_Linear_no_batch_dim_cuda_tf32, test/test_nn.py::TestNN::test_Linear_no_bias, test/test_nn.py::TestNN::test_Linear_no_bias_cuda_fp32, test/test_nn.py::TestNN::test_Linear_no_bias_cuda_tf32, test/test_nn.py::TestNN::test_LogSigmoid_no_batch_dim, test/test_nn.py::TestNN::test_LogSigmoid_no_batch_dim_cuda, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_none, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_MSELoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_MSELoss_no_reduce, test/test_nn.py::TestNN::test_MSELoss_no_reduce_cuda, test/test_nn.py::TestNN::test_MSELoss_no_reduce_scalar, test/test_nn.py::TestNN::test_MSELoss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_MarginRankingLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_MaxUnpool1d_net, test/test_nn.py::TestNN::test_MaxUnpool1d_net_cuda, test/test_nn.py::TestNN::test_MaxUnpool1d_net_no_batch_dim, test/test_nn.py::TestNN::test_MaxUnpool1d_net_no_batch_dim_cuda, test/test_nn.py::TestNN::test_MaxUnpool2d_net, test/test_nn.py::TestNN::test_MaxUnpool2d_net_cuda, test/test_nn.py::TestNN::test_MaxUnpool2d_net_no_batch_dim, test/test_nn.py::TestNN::test_MaxUnpool2d_net_no_batch_dim_cuda, test/test_nn.py::TestNN::test_MaxUnpool3d_net, test/test_nn.py::TestNN::test_MaxUnpool3d_net_cuda, test/test_nn.py::TestNN::test_MaxUnpool3d_net_no_batch_dim, test/test_nn.py::TestNN::test_MaxUnpool3d_net_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Mish_no_batch_dim, test/test_nn.py::TestNN::test_Mish_no_batch_dim_cuda, test/test_nn.py::TestNN::test_ModuleDict, test/test_nn.py::TestNN::test_ModuleList, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_0d_no_reduce, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_0d_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_1d_no_reduce, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_1d_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_index_neg, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_index_neg_cuda, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_reduce, test/test_nn.py::TestNN::test_MultiLabelMarginLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_reduce, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_weights_no_reduce, test/test_nn.py::TestNN::test_MultiLabelSoftMarginLoss_weights_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiMarginLoss_1d_no_reduce, test/test_nn.py::TestNN::test_MultiMarginLoss_1d_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiMarginLoss_margin_no_reduce, test/test_nn.py::TestNN::test_MultiMarginLoss_margin_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiMarginLoss_no_reduce, test/test_nn.py::TestNN::test_MultiMarginLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiMarginLoss_p_no_reduce, test/test_nn.py::TestNN::test_MultiMarginLoss_p_no_reduce_cuda, test/test_nn.py::TestNN::test_MultiMarginLoss_weights_no_reduce, test/test_nn.py::TestNN::test_MultiMarginLoss_weights_no_reduce_cuda, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce_cuda, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce_ignore_index, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce_ignore_index_cuda, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce_weights, test/test_nn.py::TestNN::test_NLLLoss2d_no_reduce_weights_cuda, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce_cuda, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce_ignore_index, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce_ignore_index_cuda, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce_weights, test/test_nn.py::TestNN::test_NLLLossNd_no_reduce_weights_cuda, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_NLLLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_NLLLoss_no_reduce, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_ignore_index, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_ignore_index_cuda, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights_cuda, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights_ignore_index, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights_ignore_index_cuda, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights_ignore_index_neg, test/test_nn.py::TestNN::test_NLLLoss_no_reduce_weights_ignore_index_neg_cuda, test/test_nn.py::TestNN::test_PReLU_backward_requires_grad_false, test/test_nn.py::TestNN::test_PReLU_no_batch_dim, test/test_nn.py::TestNN::test_PReLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_PairwiseDistance, test/test_nn.py::TestNN::test_PairwiseDistance_broadcast_lhs, test/test_nn.py::TestNN::test_PairwiseDistance_broadcast_lhs_cuda, test/test_nn.py::TestNN::test_PairwiseDistance_broadcast_rhs, test/test_nn.py::TestNN::test_PairwiseDistance_broadcast_rhs_cuda, test/test_nn.py::TestNN::test_PairwiseDistance_cuda, test/test_nn.py::TestNN::test_PairwiseDistance_no_batch_dim, test/test_nn.py::TestNN::test_PairwiseDistance_no_batch_dim_cuda, test/test_nn.py::TestNN::test_PairwiseDistance_with_non_default_args, test/test_nn.py::TestNN::test_PairwiseDistance_with_non_default_args_cuda, test/test_nn.py::TestNN::test_ParameterDict, test/test_nn.py::TestNN::test_ParameterDict_replication, test/test_nn.py::TestNN::test_ParameterList, test/test_nn.py::TestNN::test_ParameterList_meta, test/test_nn.py::TestNN::test_ParameterList_replication, test/test_nn.py::TestNN::test_PixelShuffle, test/test_nn.py::TestNN::test_PixelShuffle_cuda, test/test_nn.py::TestNN::test_PixelUnshuffle, test/test_nn.py::TestNN::test_PixelUnshuffle_cuda, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_reduce, test/test_nn.py::TestNN::test_PoissonNLLLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_RNN_cell, test/test_nn.py::TestNN::test_RNN_cell_forward_zero_hidden_size, test/test_nn.py::TestNN::test_RNN_cell_no_broadcasting, test/test_nn.py::TestNN::test_RNN_change_dropout, test/test_nn.py::TestNN::test_RNN_cpu_vs_cudnn_no_dropout, test/test_nn.py::TestNN::test_RNN_cpu_vs_cudnn_with_dropout, test/test_nn.py::TestNN::test_RNN_cudnn_weight_norm, test/test_nn.py::TestNN::test_RNN_dropout, test/test_nn.py::TestNN::test_RNN_dropout_state, test/test_nn.py::TestNN::test_RNN_input_size_zero, test/test_nn.py::TestNN::test_RNN_nonlinearity, test/test_nn.py::TestNN::test_RReLU, test/test_nn.py::TestNN::test_RReLU_cuda, test/test_nn.py::TestNN::test_RReLU_no_batch_dim, test/test_nn.py::TestNN::test_RReLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_RReLU_with_up_down, test/test_nn.py::TestNN::test_RReLU_with_up_down_cuda, test/test_nn.py::TestNN::test_RReLU_with_up_down_scalar, test/test_nn.py::TestNN::test_RReLU_with_up_down_scalar_cuda, test/test_nn.py::TestNN::test_ReLU6_no_batch_dim, test/test_nn.py::TestNN::test_ReLU6_no_batch_dim_cuda, test/test_nn.py::TestNN::test_ReLU_no_batch_dim, test/test_nn.py::TestNN::test_ReLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_ReplicationPad3d, test/test_nn.py::TestNN::test_ReplicationPad3d_complex, test/test_nn.py::TestNN::test_ReplicationPad3d_complex_cuda, test/test_nn.py::TestNN::test_ReplicationPad3d_cuda, test/test_nn.py::TestNN::test_ReplicationPad3d_no_batch_dim, test/test_nn.py::TestNN::test_ReplicationPad3d_no_batch_dim_cuda, test/test_nn.py::TestNN::test_SELU_no_batch_dim, test/test_nn.py::TestNN::test_SELU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Sequential_add, test/test_nn.py::TestNN::test_Sequential_append, test/test_nn.py::TestNN::test_Sequential_delitem, test/test_nn.py::TestNN::test_Sequential_extend, test/test_nn.py::TestNN::test_Sequential_getitem, test/test_nn.py::TestNN::test_Sequential_iadd, test/test_nn.py::TestNN::test_Sequential_imul, test/test_nn.py::TestNN::test_Sequential_insert, test/test_nn.py::TestNN::test_Sequential_insert_fail_case, test/test_nn.py::TestNN::test_Sequential_mul, test/test_nn.py::TestNN::test_Sequential_pop, test/test_nn.py::TestNN::test_Sequential_rmul, test/test_nn.py::TestNN::test_Sequential_setitem, test/test_nn.py::TestNN::test_Sequential_setitem_named, test/test_nn.py::TestNN::test_SiLU_no_batch_dim, test/test_nn.py::TestNN::test_SiLU_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Sigmoid_no_batch_dim, test/test_nn.py::TestNN::test_Sigmoid_no_batch_dim_cuda, test/test_nn.py::TestNN::test_SmoothL1Loss_beta, test/test_nn.py::TestNN::test_SmoothL1Loss_beta_cuda, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_mean, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_none, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_sum, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_SmoothL1Loss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_SmoothL1Loss_no_reduce, test/test_nn.py::TestNN::test_SmoothL1Loss_no_reduce_cuda, test/test_nn.py::TestNN::test_SmoothL1Loss_no_reduce_scalar, test/test_nn.py::TestNN::test_SmoothL1Loss_no_reduce_scalar_cuda, test/test_nn.py::TestNN::test_SmoothL1Loss_zero_beta, test/test_nn.py::TestNN::test_SmoothL1Loss_zero_beta_cuda, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_SoftMarginLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_SoftMarginLoss_no_reduce, test/test_nn.py::TestNN::test_SoftMarginLoss_no_reduce_cuda, test/test_nn.py::TestNN::test_Softplus_no_batch_dim, test/test_nn.py::TestNN::test_Softplus_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Softshrink_no_batch_dim, test/test_nn.py::TestNN::test_Softshrink_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Softsign_no_batch_dim, test/test_nn.py::TestNN::test_Softsign_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Tanh_no_batch_dim, test/test_nn.py::TestNN::test_Tanh_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Tanhshrink_no_batch_dim, test/test_nn.py::TestNN::test_Tanhshrink_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Threshold_no_batch_dim, test/test_nn.py::TestNN::test_Threshold_no_batch_dim_cuda, test/test_nn.py::TestNN::test_TransformerDecoderLayer_gelu_activation, test/test_nn.py::TestNN::test_TransformerDecoderLayer_gelu_activation_cuda_fp32, test/test_nn.py::TestNN::test_TransformerDecoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerDecoderLayer_relu_activation, test/test_nn.py::TestNN::test_TransformerDecoderLayer_relu_activation_cuda_fp32, test/test_nn.py::TestNN::test_TransformerDecoderLayer_relu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_fp32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_relu_activation, test/test_nn.py::TestNN::test_TransformerEncoderLayer_relu_activation_cuda_fp32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_relu_activation_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_cell, test/test_nn.py::TestNN::test_Transformer_multilayer_coder, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_fp32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_mean, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_mean_cuda_double, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_mean_cuda_fp32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_mean_cuda_half, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_mean_cuda_tf32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_none, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_none_cuda_double, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_none_cuda_fp32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_none_cuda_half, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_none_cuda_tf32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_sum, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_sum_cuda_double, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_sum_cuda_fp32, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_sum_cuda_half, test/test_nn.py::TestNN::test_TripletMarginLoss_no_batch_dim_sum_cuda_tf32, test/test_nn.py::TestNN::test_Unflatten_no_batch_dim, test/test_nn.py::TestNN::test_Unflatten_no_batch_dim_cuda, test/test_nn.py::TestNN::test_Unfold, test/test_nn.py::TestNN::test_Unfold_cuda, test/test_nn.py::TestNN::test_Unfold_int_input, test/test_nn.py::TestNN::test_Unfold_int_input_cuda, test/test_nn.py::TestNN::test_adaptive_log_softmax, test/test_nn.py::TestNN::test_add_module, test/test_nn.py::TestNN::test_add_module_raises_error_if_attr_exists, test/test_nn.py::TestNN::test_affine_grid, test/test_nn.py::TestNN::test_affine_grid_3d, test/test_nn.py::TestNN::test_affine_grid_error_checking, test/test_nn.py::TestNN::test_assignment, test/test_nn.py::TestNN::test_batchnorm_buffer_update_when_stats_are_not_tracked, test/test_nn.py::TestNN::test_batchnorm_cudnn_half, test/test_nn.py::TestNN::test_batchnorm_cudnn_nhwc, test/test_nn.py::TestNN::test_batchnorm_load_state_dict, test/test_nn.py::TestNN::test_batchnorm_nhwc_cpu, test/test_nn.py::TestNN::test_batchnorm_nhwc_cuda, test/test_nn.py::TestNN::test_batchnorm_non_contig_cpu_BatchNorm2d, test/test_nn.py::TestNN::test_batchnorm_non_contig_cpu_SyncBatchNorm, test/test_nn.py::TestNN::test_batchnorm_nonaffine_cuda_half_input, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_bias_is_not_same_size_as_input, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_less_than_one_value_per_channel, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_running_mean_is_not_same_size_as_input, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_running_var_is_not_same_size_as_input, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_running_var_or_running_mean_have_forward_grad, test/test_nn.py::TestNN::test_batchnorm_raises_error_if_weight_is_not_same_size_as_input, test/test_nn.py::TestNN::test_bce_loss_always_nonnegative, test/test_nn.py::TestNN::test_bce_loss_broadcasts_weights, test/test_nn.py::TestNN::test_bce_loss_input_range, test/test_nn.py::TestNN::test_bce_loss_size_mismatch, test/test_nn.py::TestNN::test_bce_with_logits_broadcasts_pos_weights, test/test_nn.py::TestNN::test_bce_with_logits_broadcasts_weights, test/test_nn.py::TestNN::test_bce_with_logits_gives_same_result_as_sigmoid_and_bce_loss, test/test_nn.py::TestNN::test_bce_with_logits_gives_same_result_as_sigmoid_and_bce_loss_large_tensors_with_grad, test/test_nn.py::TestNN::test_bce_with_logits_has_correct_forward_grad, test/test_nn.py::TestNN::test_bce_with_logits_has_correct_grad_at_zero, test/test_nn.py::TestNN::test_bce_with_logits_ones_in_pos_weights_are_the_same_as_none, test/test_nn.py::TestNN::test_bce_with_logits_raises_if_target_and_input_are_different_size, test/test_nn.py::TestNN::test_bce_with_logits_stability, test/test_nn.py::TestNN::test_bce_with_logits_with_pos_weight_has_correct_grad_at_zero, test/test_nn.py::TestNN::test_bilinear, test/test_nn.py::TestNN::test_bilinear_broadcasting, test/test_nn.py::TestNN::test_bilinear_no_bias, test/test_nn.py::TestNN::test_bilinear_non_contiguous, test/test_nn.py::TestNN::test_broadcast_double_backwards_gpu, test/test_nn.py::TestNN::test_broadcast_no_grad, test/test_nn.py::TestNN::test_broadcast_not_requiring_grad, test/test_nn.py::TestNN::test_buffer_not_persistent, test/test_nn.py::TestNN::test_buffer_not_persistent_assign, test/test_nn.py::TestNN::test_buffer_not_persistent_del, test/test_nn.py::TestNN::test_buffer_not_persistent_load, test/test_nn.py::TestNN::test_buffer_not_persistent_overwrite, test/test_nn.py::TestNN::test_buffers_and_named_buffers, test/test_nn.py::TestNN::test_call_supports_python_dict_output, test/test_nn.py::TestNN::test_channel_shuffle_return_alias_of_self, test/test_nn.py::TestNN::test_children, test/test_nn.py::TestNN::test_container_copy, test/test_nn.py::TestNN::test_convert_sync_batchnorm, test/test_nn.py::TestNN::test_cosine_embedding_loss_error_on_diff_shapes, test/test_nn.py::TestNN::test_cosine_embedding_loss_error_on_nonexpandable_shapes, test/test_nn.py::TestNN::test_cosine_embedding_loss_invalid_shape, test/test_nn.py::TestNN::test_cosine_embedding_loss_margin_no_reduce, test/test_nn.py::TestNN::test_cosine_embedding_loss_no_reduce, test/test_nn.py::TestNN::test_cosine_embedding_loss_with_diff_type, test/test_nn.py::TestNN::test_cosine_similarity, test/test_nn.py::TestNN::test_cross_entropy_loss, test/test_nn.py::TestNN::test_cross_entropy_loss_precision, test/test_nn.py::TestNN::test_cross_entropy_loss_zero_div, test/test_nn.py::TestNN::test_cudnn_forward_exception, test/test_nn.py::TestNN::test_cudnn_rnn_dropout_states_device, test/test_nn.py::TestNN::test_cudnn_weight_format, test/test_nn.py::TestNN::test_cudnn_weight_tying, test/test_nn.py::TestNN::test_dir, test/test_nn.py::TestNN::test_dir_digit, test/test_nn.py::TestNN::test_elu_inplace_gradgrad, test/test_nn.py::TestNN::test_elu_inplace_on_view, test/test_nn.py::TestNN::test_error_RNN_seq_len_zero, test/test_nn.py::TestNN::test_extra_state, test/test_nn.py::TestNN::test_extra_state_missing_get_extra_state, test/test_nn.py::TestNN::test_extra_state_missing_set_extra_state, test/test_nn.py::TestNN::test_extra_state_non_dict, test/test_nn.py::TestNN::test_fb_fc_packed, test/test_nn.py::TestNN::test_flatten, test/test_nn.py::TestNN::test_fold_invalid_arg, test/test_nn.py::TestNN::test_fractional_max_pool2d_invalid_output_ratio, test/test_nn.py::TestNN::test_gaussian_nll_loss_args, test/test_nn.py::TestNN::test_gaussian_nll_loss_broadcasting, test/test_nn.py::TestNN::test_get_buffer, test/test_nn.py::TestNN::test_get_buffer_from_submodules, test/test_nn.py::TestNN::test_getattr_with_property, test/test_nn.py::TestNN::test_grid_sample, test/test_nn.py::TestNN::test_grid_sample_3d, test/test_nn.py::TestNN::test_grid_sample_error_checking, test/test_nn.py::TestNN::test_grid_sample_nearest_neighbor_rounding_mode_consistency, test/test_nn.py::TestNN::test_hardtanh_backward, test/test_nn.py::TestNN::test_hardtanh_inplace_gradgrad, test/test_nn.py::TestNN::test_huber_loss_invalid_delta, test/test_nn.py::TestNN::test_inplace_thnn, test/test_nn.py::TestNN::test_interpolate, test/test_nn.py::TestNN::test_interpolate_bicubic_2d, test/test_nn.py::TestNN::test_interpolate_bicubic_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_2d_zero_dim, test/test_nn.py::TestNN::test_interpolate_bicubic_2d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_2d, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_shared_2d, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_shared_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_skewed_2d, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_skewed_2d_align_corners, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_skewed_2d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_scale_tuple_skewed_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_tuple_2d, test/test_nn.py::TestNN::test_interpolate_bicubic_tuple_2d_align_corners, test/test_nn.py::TestNN::test_interpolate_bicubic_tuple_2d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_bicubic_tuple_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_2d, test/test_nn.py::TestNN::test_interpolate_bilinear_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_2d_zero_dim, test/test_nn.py::TestNN::test_interpolate_bilinear_2d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_2d, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_shared_2d, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_shared_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_skewed_2d, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_skewed_2d_align_corners, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_skewed_2d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_scale_tuple_skewed_2d_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_tuple_2d, test/test_nn.py::TestNN::test_interpolate_bilinear_tuple_2d_align_corners, test/test_nn.py::TestNN::test_interpolate_bilinear_tuple_2d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_bilinear_tuple_2d_cuda, test/test_nn.py::TestNN::test_interpolate_buffer_overflow, test/test_nn.py::TestNN::test_interpolate_illegal_memory_access, test/test_nn.py::TestNN::test_interpolate_linear_1d, test/test_nn.py::TestNN::test_interpolate_linear_1d_align_corners, test/test_nn.py::TestNN::test_interpolate_linear_1d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_linear_1d_cuda, test/test_nn.py::TestNN::test_interpolate_linear_1d_zero_dim, test/test_nn.py::TestNN::test_interpolate_linear_1d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_linear_scale_1d, test/test_nn.py::TestNN::test_interpolate_linear_scale_1d_align_corners, test/test_nn.py::TestNN::test_interpolate_linear_scale_1d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_linear_scale_1d_cuda, test/test_nn.py::TestNN::test_interpolate_linear_tuple_1d, test/test_nn.py::TestNN::test_interpolate_linear_tuple_1d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_1d, test/test_nn.py::TestNN::test_interpolate_nearest_1d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_1d_zero_dim, test/test_nn.py::TestNN::test_interpolate_nearest_1d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_2d, test/test_nn.py::TestNN::test_interpolate_nearest_2d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_2d_launch_configs, test/test_nn.py::TestNN::test_interpolate_nearest_2d_launch_configs_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_2d_zero_dim, test/test_nn.py::TestNN::test_interpolate_nearest_2d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_3d, test/test_nn.py::TestNN::test_interpolate_nearest_3d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_3d_zero_dim, test/test_nn.py::TestNN::test_interpolate_nearest_3d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_scale_1d, test/test_nn.py::TestNN::test_interpolate_nearest_scale_1d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_scale_2d, test/test_nn.py::TestNN::test_interpolate_nearest_scale_2d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_scale_3d, test/test_nn.py::TestNN::test_interpolate_nearest_scale_3d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_1d, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_1d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_2d, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_2d_cuda, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_3d, test/test_nn.py::TestNN::test_interpolate_nearest_tuple_3d_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_3d, test/test_nn.py::TestNN::test_interpolate_trilinear_3d_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_3d_zero_dim, test/test_nn.py::TestNN::test_interpolate_trilinear_3d_zero_dim_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_scale_3d, test/test_nn.py::TestNN::test_interpolate_trilinear_scale_3d_align_corners, test/test_nn.py::TestNN::test_interpolate_trilinear_scale_3d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_scale_3d_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_tuple_3d, test/test_nn.py::TestNN::test_interpolate_trilinear_tuple_3d_align_corners, test/test_nn.py::TestNN::test_interpolate_trilinear_tuple_3d_align_corners_cuda, test/test_nn.py::TestNN::test_interpolate_trilinear_tuple_3d_cuda, test/test_nn.py::TestNN::test_interpolate_undefined_behavior_casting, test/test_nn.py::TestNN::test_kl_div_log_softmax_target, test/test_nn.py::TestNN::test_kl_div_with_diff_type, test/test_nn.py::TestNN::test_kl_div_with_diff_type_log_target, test/test_nn.py::TestNN::test_l1_loss_correct, test/test_nn.py::TestNN::test_layer_norm_eps, test/test_nn.py::TestNN::test_layer_norm_grads_with_create_graph_flag, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_bias_weightCOO, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_bias_weightCSC, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_bias_weightCSR, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_bias_weightStrided, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_nobias_weightCOO, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_nobias_weightCSC, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_nobias_weightCSR, test/test_nn.py::TestNN::test_linear_autograd_device_cpu_nobias_weightStrided, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_bias_weightCOO, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_bias_weightCSC, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_bias_weightCSR, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_bias_weightStrided, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_nobias_weightCOO, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_nobias_weightCSC, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_nobias_weightCSR, test/test_nn.py::TestNN::test_linear_autograd_device_cuda_nobias_weightStrided, test/test_nn.py::TestNN::test_linear_broadcasting, test/test_nn.py::TestNN::test_load_state_dict, test/test_nn.py::TestNN::test_load_state_dict_BC, test/test_nn.py::TestNN::test_load_state_dict_assign_meta, test/test_nn.py::TestNN::test_load_state_dict_assign_shape_stride, test/test_nn.py::TestNN::test_load_state_dict_assign_with_optimizer, test/test_nn.py::TestNN::test_load_state_dict_child, test/test_nn.py::TestNN::test_load_state_dict_custom, test/test_nn.py::TestNN::test_load_state_dict_invalid, test/test_nn.py::TestNN::test_load_state_dict_ref_cycle, test/test_nn.py::TestNN::test_load_state_dict_type, test/test_nn.py::TestNN::test_load_state_dict_warn_assign, test/test_nn.py::TestNN::test_log_softmax_dim0, test/test_nn.py::TestNN::test_log_softmax_dim0_cuda, test/test_nn.py::TestNN::test_log_softmax_dim3, test/test_nn.py::TestNN::test_log_softmax_dim3_cuda, test/test_nn.py::TestNN::test_log_softmax_lastdim, test/test_nn.py::TestNN::test_log_softmax_lastdim_cuda, test/test_nn.py::TestNN::test_log_softmax_scalar, test/test_nn.py::TestNN::test_log_softmax_scalar_cuda, test/test_nn.py::TestNN::test_log_softmax_spatial, test/test_nn.py::TestNN::test_log_softmax_spatial_cuda, test/test_nn.py::TestNN::test_log_softmax_spatial_special, test/test_nn.py::TestNN::test_log_softmax_spatial_special_cuda, test/test_nn.py::TestNN::test_loss_equal_input_target_shape, test/test_nn.py::TestNN::test_margin_ranking_loss_margin_no_reduce, test/test_nn.py::TestNN::test_margin_ranking_loss_no_reduce, test/test_nn.py::TestNN::test_max_pool1d_invalid_output_size, test/test_nn.py::TestNN::test_module_apply_inplace_op, test/test_nn.py::TestNN::test_module_backcompat, test/test_nn.py::TestNN::test_module_super_init, test/test_nn.py::TestNN::test_module_to_argparse, test/test_nn.py::TestNN::test_modules, test/test_nn.py::TestNN::test_mse_loss_size_warning, test/test_nn.py::TestNN::test_multimarginloss_1d_input_0d_target_no_reduce, test/test_nn.py::TestNN::test_multimarginloss_1d_input_0d_target_no_reduce_cuda, test/test_nn.py::TestNN::test_named_children, test/test_nn.py::TestNN::test_named_modules, test/test_nn.py::TestNN::test_named_parameters_remove_duplicate, test/test_nn.py::TestNN::test_nested_tensor_from_mask, test/test_nn.py::TestNN::test_nested_tensor_from_mask_error, test/test_nn.py::TestNN::test_no_grad, test/test_nn.py::TestNN::test_non_leaf_parameters, test/test_nn.py::TestNN::test_normalize, test/test_nn.py::TestNN::test_overwrite_module_params_on_conversion, test/test_nn.py::TestNN::test_pack_sequence_batch_sizes_throw, test/test_nn.py::TestNN::test_pad_scalar_error, test/test_nn.py::TestNN::test_padding_list, test/test_nn.py::TestNN::test_pairwise_distance, test/test_nn.py::TestNN::test_parameter_assignment, test/test_nn.py::TestNN::test_parameterlistdict_pickle, test/test_nn.py::TestNN::test_parameterlistdict_setting_attributes, test/test_nn.py::TestNN::test_parameters_and_named_parameters, test/test_nn.py::TestNN::test_parameters_to_vector, test/test_nn.py::TestNN::test_parse_to, test/test_nn.py::TestNN::test_partial_flat_weights, test/test_nn.py::TestNN::test_pdist, test/test_nn.py::TestNN::test_pdist_cpu_gradgrad_unimplemented, test/test_nn.py::TestNN::test_pdist_cuda_gradgrad_unimplemented, test/test_nn.py::TestNN::test_pdist_empty_col, test/test_nn.py::TestNN::test_pdist_empty_row, test/test_nn.py::TestNN::test_pdist_large, test/test_nn.py::TestNN::test_pdist_zeros, test/test_nn.py::TestNN::test_pixel_shuffle_nhwc_cpu, test/test_nn.py::TestNN::test_pixel_shuffle_unshuffle, test/test_nn.py::TestNN::test_pointwise_loss_broadcast, test/test_nn.py::TestNN::test_pointwise_loss_target_grad_none_reduction, test/test_nn.py::TestNN::test_projections_errors_on_gru_and_rnn, test/test_nn.py::TestNN::test_projections_lstm_args_check, test/test_nn.py::TestNN::test_projections_lstm_check_device, test/test_nn.py::TestNN::test_projections_lstm_initial_hidden_state, test/test_nn.py::TestNN::test_register_buffer_allows_overwriting_with_same_name, test/test_nn.py::TestNN::test_register_buffer_raises_error_if_attr_exists, test/test_nn.py::TestNN::test_register_buffer_raises_error_if_name_is_not_string, test/test_nn.py::TestNN::test_register_buffer_raises_error_if_not_tensor, test/test_nn.py::TestNN::test_register_parameter_allows_overwriting_with_same_name, test/test_nn.py::TestNN::test_register_parameter_raises_error_if_attr_exists, test/test_nn.py::TestNN::test_register_parameter_raises_error_if_name_is_not_string, test/test_nn.py::TestNN::test_register_state_dict_pre_hook, test/test_nn.py::TestNN::test_register_state_dict_pre_hook_backward_compat, test/test_nn.py::TestNN::test_register_state_dict_pre_hook_lazy_module, test/test_nn.py::TestNN::test_relu_inplace_on_view, test/test_nn.py::TestNN::test_repr, test/test_nn.py::TestNN::test_requires_grad_, test/test_nn.py::TestNN::test_rnn_args_check, test/test_nn.py::TestNN::test_rnn_check_device, test/test_nn.py::TestNN::test_rnn_initial_hidden_state, test/test_nn.py::TestNN::test_rnn_weight_norm, test/test_nn.py::TestNN::test_share_memory, test/test_nn.py::TestNN::test_smoothl1loss_intergral_target, test/test_nn.py::TestNN::test_smoothl1loss_negative_beta_not_supported, test/test_nn.py::TestNN::test_softmax_functional_dim0, test/test_nn.py::TestNN::test_softmax_functional_dim0_cuda, test/test_nn.py::TestNN::test_softmax_functional_dim3, test/test_nn.py::TestNN::test_softmax_functional_dim3_cuda, test/test_nn.py::TestNN::test_softmax_functional_scalar, test/test_nn.py::TestNN::test_softmax_functional_scalar_cuda, test/test_nn.py::TestNN::test_softmax_lastdim, test/test_nn.py::TestNN::test_softmax_lastdim_cuda, test/test_nn.py::TestNN::test_softmax_lastdim_dtype, test/test_nn.py::TestNN::test_softmax_lastdim_dtype_cuda, test/test_nn.py::TestNN::test_softmax_spatial, test/test_nn.py::TestNN::test_softmax_spatial_cuda, test/test_nn.py::TestNN::test_softmax_spatial_dtype, test/test_nn.py::TestNN::test_softmax_spatial_dtype_cuda, test/test_nn.py::TestNN::test_softmax_spatial_special, test/test_nn.py::TestNN::test_softmax_spatial_special_cuda, test/test_nn.py::TestNN::test_softmin, test/test_nn.py::TestNN::test_spectral_norm, test/test_nn.py::TestNN::test_spectral_norm_dim, test/test_nn.py::TestNN::test_spectral_norm_forward, test/test_nn.py::TestNN::test_spectral_norm_load_state_dict, test/test_nn.py::TestNN::test_spectral_norm_pickle, test/test_nn.py::TestNN::test_state_dict, test/test_nn.py::TestNN::test_sync_batchnorm_accuracy_cuda, test/test_nn.py::TestNN::test_sync_batchnorm_backward_elemt, test/test_nn.py::TestNN::test_threshold_bfloat16_half, test/test_nn.py::TestNN::test_threshold_int, test/test_nn.py::TestNN::test_to, test/test_nn.py::TestNN::test_train_errors_for_invalid_mode, test/test_nn.py::TestNN::test_transformer_args_check, test/test_nn.py::TestNN::test_transformer_layer_args_check, test/test_nn.py::TestNN::test_transformerdecoder, test/test_nn.py::TestNN::test_transformerdecoderlayer, test/test_nn.py::TestNN::test_transformerdecoderlayer_gelu, test/test_nn.py::TestNN::test_triplet_margin_loss, test/test_nn.py::TestNN::test_triplet_margin_loss_no_reduce, test/test_nn.py::TestNN::test_triplet_margin_loss_swap, test/test_nn.py::TestNN::test_triplet_margin_loss_swap_no_reduce, test/test_nn.py::TestNN::test_type, test/test_nn.py::TestNN::test_unflatten, test/test_nn.py::TestNN::test_unflatten_invalid_arg, test/test_nn.py::TestNN::test_unfold_invalid_arg, test/test_nn.py::TestNN::test_upsamplingBilinear2d_spatial_invariance, test/test_nn.py::TestNN::test_upsamplingLinear1d, test/test_nn.py::TestNN::test_upsamplingLinear1d_spatial_invariance, test/test_nn.py::TestNN::test_upsamplingTrilinear3d_spatial_invariance, test/test_nn.py::TestNN::test_upsampling_bfloat16, test/test_nn.py::TestNN::test_upsampling_not_recompute_scale_factor, test/test_nn.py::TestNN::test_upsampling_small_scale, test/test_nn.py::TestNN::test_vector_to_parameters, test/test_nn.py::TestNN::test_weight_norm, test/test_nn.py::TestNN::test_weight_norm_pickle, test/test_nn.py::TestNN::test_zero_grad, test/test_nn.py::TestFusionEval::test_fuse_module_eval_numerics, test/test_nn.py::TestConstantPadNd::test_constant_pad_nd, test/test_nn.py::TestConstantPadNd::test_preserves_memory_format, test/test_nn.py::TestAddRelu::test_add_relu, test/test_nn.py::TestAddRelu::test_add_relu_broadcasting, test/test_nn.py::TestFunctionalPickle::test_pickle_softsign, test/test_nn.py::TestFusionUtils::test_fuse_conv_bn_requires_grad, test/test_nn.py::TestFusionUtils::test_fuse_linear_bn_requires_grad, test/test_nn.py::TestNNDeviceTypeCUDA::test_BatchNorm_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_Bilinear_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_cudnn_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_empty_target_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_mean_use_module_form_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_mean_use_module_form_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_none_use_module_form_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_none_use_module_form_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_sum_use_module_form_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_CTCLoss_no_batch_dim_reduction_sum_use_module_form_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_GRU_grad_and_gradgrad_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_GroupNorm_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_GroupNorm_general_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_GroupNorm_memory_format_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_GroupNorm_numeric_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_GroupNorm_raises_error_if_one_value_per_group_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_InstanceNorm1d_general_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_InstanceNorm2d_general_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_InstanceNorm3d_general_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_LSTM_differentiable_backward_using_oneDNN_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_LSTM_differentiable_backward_using_oneDNN_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_LSTM_grad_and_gradgrad_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_LayerNorm_general_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_LayerNorm_numeric_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_LocalResponseNorm_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_MarginLoss_empty_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_MarginLoss_empty_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_MarginLoss_warnings_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReflectionPad2d_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReflectionPad3d_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReflectionPad_empty_cuda_complex64, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReflectionPad_empty_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReplicationPad1d_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReplicationPad2d_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReplicationPad3d_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReplicationPad_empty_cuda_complex128, test/test_nn.py::TestNNDeviceTypeCUDA::test_ReplicationPad_empty_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_TransformerDecoderLayer_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_TransformerDecoder_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_TransformerEncoderLayer_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_TransformerEncoder_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_Transformer_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_Unfold_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_activations_bfloat16_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_activations_bfloat16_half_cpu_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_activations_bfloat16_half_cpu_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_adaptiveavg_pool1d_shmem_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_affine_2d_rotate0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_affine_2d_rotate45_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_affine_2d_rotate90_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_affine_2d_rotateRandom_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_affine_3d_rotateRandom_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_avg_pool_large_tensor_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_affine_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_affine_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_affine_mixed_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_affine_mixed_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_eval_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_eval_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_eval_mixed_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_eval_mixed_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_grad_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_large_batch_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_large_batch_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_simple_average_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_simple_average_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_simple_average_mixed_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_simple_average_mixed_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_batchnorm_update_stats_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_channel_shuffle_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_error_if_nonfinite_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_False_norm_type_0_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_False_norm_type_1_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_False_norm_type_2_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_False_norm_type_4_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_False_norm_type_inf_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_True_norm_type_0_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_True_norm_type_1_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_True_norm_type_2_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_True_norm_type_4_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_foreach_True_norm_type_inf_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_multi_device_foreach_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_norm_multi_device_foreach_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_value_foreach_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_clip_grad_value_foreach_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_empty_input_cuda_complex128, test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_empty_input_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_empty_input_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_empty_input_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_64bit_reduction_mean_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_64bit_reduction_none_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_64bit_reduction_sum_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_consistent_index_target_and_probs_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_errors_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_weight_ignore_indices_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_label_smoothing_with_probs_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_mean_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_none_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_large_tensor_reduction_sum_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_index_target_unit_weights_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_one_hot_target_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_all_reductions_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_mean_weighted_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_mean_weighted_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_none_weighted_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_none_weighted_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_sum_weighted_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_no_batch_dim_reduction_sum_weighted_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_prob_target_unit_weights_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ctc_loss_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_ctc_loss_cudnn_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_device_mask_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_elu_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_elu_inplace_with_neg_alpha_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_fold_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_glu_bfloat16_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_bfloat16_precision_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_half_precision_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_large_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_large_index_2d_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_large_index_2d_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_large_index_3d_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_large_index_3d_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_nan_inf_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_grid_sample_nan_inf_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_groupnorm_nhwc_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_groupnorm_nhwc_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_groupnorm_nhwc_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_groupnorm_nhwc_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_gumbel_softmax_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_gumbel_softmax_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_gumbel_softmax_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_hardsigmoid_grad_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_hardswish_grad_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_hardswish_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_for_single_spatial_element_during_training_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm1d_no_batch_dim_False_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm1d_no_batch_dim_False_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm1d_no_batch_dim_True_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm1d_no_batch_dim_True_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm2d_no_batch_dim_False_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm2d_no_batch_dim_False_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm2d_no_batch_dim_True_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm2d_no_batch_dim_True_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm3d_no_batch_dim_False_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm3d_no_batch_dim_False_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm3d_no_batch_dim_True_affine_False_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_input_channels_is_not_num_features_InstanceNorm3d_no_batch_dim_True_affine_True_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_instancenorm_raises_error_if_less_than_one_value_per_channel_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_invalid_reduction_strings_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_layernorm_half_precision_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_layernorm_weight_bias_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_leaky_relu_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_leaky_relu_inplace_with_neg_slope_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_leaky_relu_inplace_with_zero_slope_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_linear_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_log_softmax_big_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_log_softmax_big_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_log_softmax_cpu_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_log_softmax_cpu_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_logsigmoid_out_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_lstmcell_backward_only_one_output_grad_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_TxT_layout_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_devices_parity_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_forward_with_nans_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_grad_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_lowp_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_lowp_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_mask_types_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_masked_softmax_transformer_layout_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_mish_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_module_to_empty_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_module_to_empty_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_module_to_empty_non_recursive_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_all_ignored_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_byte_target_matches_long_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_empty_tensor_reduction_mean_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_empty_tensor_reduction_none_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_empty_tensor_reduction_sum_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_invalid_target_dim_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_invalid_weights_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_large_tensor_reduction_mean_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_large_tensor_reduction_none_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_large_tensor_reduction_sum_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_mismatched_batch_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_out_of_bounds_ignore_index_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nll_loss_total_weight_is_zero_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nn_empty_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nn_scalars_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nn_scalars_reductions_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_nonlinearity_propagate_nan_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_one_hot_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_overwrite_module_params_on_conversion_cpu_device_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_pad_cuda_complex128, test/test_nn.py::TestNNDeviceTypeCUDA::test_pad_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_prelu_backward_32bit_indexing_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_rnn_fused_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_rnn_fused_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_rnn_retain_variables_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_rnn_retain_variables_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_rnn_retain_variables_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_save_lstm_compatibility_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_silu_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_skip_init_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_smooth_l1_loss_bfloat16_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_smooth_l1_loss_vs_huber_loss_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_smoothl1loss_backward_zero_beta_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_64bit_indexing_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_backward_64bit_indexing_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_bfloat16_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_cpu_cuda_bfloat16, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_cpu_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_forward_64bit_indexing_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_results_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_softmax_results_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_softplus_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softplus_low_threshold_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softshrink_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softshrink_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_softshrink_negative_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_threshold_inplace_overlap_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_to_complex_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_fast_path_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_triplet_margin_with_distance_loss_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_triplet_margin_with_distance_loss_default_parity_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format0_align_corners_False_input_size_399_output_size_437_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format0_align_corners_False_input_size_403_output_size_377_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format0_align_corners_True_input_size_399_output_size_437_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format0_align_corners_True_input_size_403_output_size_377_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format1_align_corners_False_input_size_399_output_size_437_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format1_align_corners_False_input_size_403_output_size_377_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format1_align_corners_True_input_size_399_output_size_437_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiLinear2d_consistency_interp_size_bug_memory_format1_align_corners_True_input_size_403_output_size_377_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_False_mode_bicubic_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_False_mode_bicubic_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_False_mode_bilinear_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_False_mode_bilinear_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_True_mode_bicubic_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_True_mode_bicubic_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_True_mode_bilinear_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_False_align_corners_True_mode_bilinear_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_False_mode_bicubic_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_False_mode_bicubic_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_False_mode_bilinear_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_False_mode_bilinear_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_True_mode_bicubic_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_True_mode_bicubic_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_True_mode_bilinear_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_antialias_True_align_corners_True_mode_bilinear_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format0_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bicubic_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_False_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_False_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_3_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_32_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_False_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_False_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_restrided_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_consistency_memory_format1_mode_bilinear_antialias_True_align_corners_True_num_channels_5_output_size_600_check_as_unsqueezed_3d_tensor_True_non_contig_sliced_batch_size_5_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bicubic_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_bilinear_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest-exact_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_3_mode_nearest_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bicubic_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_bilinear_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest-exact_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_False_num_channels_5_mode_nearest_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bicubic_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_bilinear_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest-exact_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_3_mode_nearest_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bicubic_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_bilinear_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest-exact_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_float32_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_float64_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_int16_cuda_int16, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_int32_cuda_int32, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_int64_cuda_int64, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_int8_cuda_int8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBiMode2d_nonsupported_dtypes_antialias_True_num_channels_5_mode_nearest_uint8_cuda_uint8, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBicubic2d_aa_correctness_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBicubic2d_aa_correctness_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBicubic2d_correctness_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBilinear2d_aa_correctness_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingBilinear2d_aa_correctness_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest1d_correctness_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest1d_correctness_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest1d_launch_config_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest1d_mode_nearest-exact_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest1d_mode_nearest_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_correctness_memory_format0_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_correctness_memory_format0_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_correctness_memory_format1_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_correctness_memory_format1_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_launch_config_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_launch_fail_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_launch_rocm_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_memory_format0_mode_nearest-exact_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_memory_format0_mode_nearest_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_memory_format1_mode_nearest-exact_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest2d_memory_format1_mode_nearest_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_correctness_memory_format0_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_correctness_memory_format0_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_correctness_memory_format1_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_correctness_memory_format1_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_launch_config_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_memory_format0_mode_nearest-exact_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_memory_format0_mode_nearest_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_memory_format1_mode_nearest-exact_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearest3d_memory_format1_mode_nearest_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact1d_correctness_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact1d_correctness_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact1d_rescale_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact2d_correctness_memory_format0_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact2d_correctness_memory_format0_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact2d_correctness_memory_format1_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact2d_correctness_memory_format1_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact3d_correctness_memory_format0_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact3d_correctness_memory_format0_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact3d_correctness_memory_format1_isize_10_osize_15_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingNearestExact3d_correctness_memory_format1_isize_20_osize_11_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingTrilinear3d_align_corners_False_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingTrilinear3d_align_corners_False_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingTrilinear3d_align_corners_True_memory_format0_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsamplingTrilinear3d_align_corners_True_memory_format1_cuda, test/test_nn.py::TestNNDeviceTypeCUDA::test_upsampling_64bit_indexing_channels_last_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float32, test/test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float64, test/test_nn.py::TestNNDeviceTypeCUDA::test_warp_softmax_64bit_indexing_cuda_float16, test/test_nn.py::TestNNDeviceTypeCUDA::test_warp_softmax_64bit_indexing_cuda_float32 2024-02-01T00:15:51.9749845Z 2024-02-01T00:15:51.9750217Z Running test_public_bindings 1/1 ... [2024-02-01 00:15:51.702030] 2024-02-01T00:15:51.9751852Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_public_bindings.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:15:51.702250] 2024-02-01T00:15:54.0206256Z 2024-02-01T00:15:54.0208037Z test_public_bindings 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_public_bindings_1.1_a7609f7497d6e925_.log 2024-02-01T00:15:54.0210604Z Running 3 items in this shard: test/test_public_bindings.py::TestPublicBindings::test_correct_module_names, test/test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported, test/test_public_bindings.py::TestPublicBindings::test_no_new_bindings 2024-02-01T00:15:54.0211977Z 2024-02-01T00:15:54.0212317Z Running test_multiprocessing 1/1 ... [2024-02-01 00:15:54.020579] 2024-02-01T00:15:54.0214106Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_multiprocessing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:15:54.020825] 2024-02-01T00:17:10.6694127Z 2024-02-01T00:17:10.6695328Z test_multiprocessing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_multiprocessing_1.1_13141a73dff7db27_.log 2024-02-01T00:17:10.6711542Z Running 37 items in this shard: test/test_multiprocessing.py::TestMultiprocessing::test_autograd_errors, test/test_multiprocessing.py::TestMultiprocessing::test_autograd_fine_with_spawn, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_bad_call, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_ipc_deadlock, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_memory_allocation, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_parameter_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_send_many, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_simple, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_small_tensors, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_empty_shared, test/test_multiprocessing.py::TestMultiprocessing::test_empty_tensor_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_empty_tensor_sharing_cuda, test/test_multiprocessing.py::TestMultiprocessing::test_event, test/test_multiprocessing.py::TestMultiprocessing::test_event_handle_exporter, test/test_multiprocessing.py::TestMultiprocessing::test_event_handle_importer, test/test_multiprocessing.py::TestMultiprocessing::test_event_handle_multi_gpu, test/test_multiprocessing.py::TestMultiprocessing::test_event_multiprocess, test/test_multiprocessing.py::TestMultiprocessing::test_fd_pool, test/test_multiprocessing.py::TestMultiprocessing::test_fd_preserve_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_fd_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_fs, test/test_multiprocessing.py::TestMultiprocessing::test_fs_is_shared, test/test_multiprocessing.py::TestMultiprocessing::test_fs_pool, test/test_multiprocessing.py::TestMultiprocessing::test_fs_preserve_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_fs_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_inherit_tensor, test/test_multiprocessing.py::TestMultiprocessing::test_integer_parameter_serialization_cpu, test/test_multiprocessing.py::TestMultiprocessing::test_integer_parameter_serialization_cuda, test/test_multiprocessing.py::TestMultiprocessing::test_is_shared, test/test_multiprocessing.py::TestMultiprocessing::test_is_shared_cuda, test/test_multiprocessing.py::TestMultiprocessing::test_leaf_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_mixed_types_cuda_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_non_leaf_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_parameter_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_wrong_cuda_fork 2024-02-01T00:17:10.6724586Z 2024-02-01T00:17:10.6725273Z Running test_modules 1/3 ... [2024-02-01 00:17:10.669507] 2024-02-01T00:17:10.6726884Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_modules.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:17:10.669730] 2024-02-01T00:32:12.2211873Z 2024-02-01T00:32:12.2217310Z test_modules 1/3 was successful, full logs can be found in artifacts with path test/test-reports/test_modules_1.3_c3997692a50489d0_.log 2024-02-01T00:32:12.2700827Z Running 951 items in this shard: test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_ELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_LeakyReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_SELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_check_inplace_nn_Threshold_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_AdaptiveAvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_AdaptiveMaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_AdaptiveMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_BatchNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CircularPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConstantPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConvTranspose1d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConvTranspose2d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConvTranspose3d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConvTranspose3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CosineEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_FractionalMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_FractionalMaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GroupNorm_cuda_bfloat16, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_GroupNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Hardshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Hardswish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_InstanceNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_KLDivLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_L1Loss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_L1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LPPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LazyConv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LazyConv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LazyConv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LazyConvTranspose2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Linear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LocalResponseNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LogSoftmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Mish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_MultiLabelMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_MultiMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_NLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_PoissonNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_RNNCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReLU6_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReflectionPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReplicationPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReplicationPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_SmoothL1Loss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Softshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Softsign_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Tanhshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Threshold_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerEncoderLayer_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerEncoder_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ZeroPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_ZeroPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveAvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveAvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveAvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveAvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveMaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AdaptiveMaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_AvgPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BCELoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BCELoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BatchNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_BatchNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Bilinear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConstantPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Conv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Conv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Conv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose1d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose2d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose2d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_FractionalMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_GELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Hardshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Hardswish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Hardswish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_InstanceNorm1d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_InstanceNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_InstanceNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_L1Loss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_L1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LPPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LPPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LazyConv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LazyConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LeakyReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LocalResponseNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LogSigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_LogSoftmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_MSELoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_MultiheadAttention_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_NLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_PoissonNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_RNN_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_RNN_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ReLU6_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ReflectionPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ReplicationPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_SELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_SoftMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_SoftMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Softmax2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Softmin_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Tanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Tanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Threshold_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_Threshold_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_TransformerDecoderLayer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_TransformerDecoderLayer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_TransformerEncoderLayer_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_TransformerEncoder_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ZeroPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_device_ctx_init_nn_ZeroPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_errors_nn_CircularPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_errors_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_errors_nn_GRU_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_errors_nn_LSTM_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_errors_nn_LSTM_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_errors_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_AdaptiveAvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_AdaptiveAvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_AdaptiveMaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_AdaptiveMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_AvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BCELoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BCEWithLogitsLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BatchNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BatchNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Bilinear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Bilinear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_CELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_CircularPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConstantPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Conv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Conv2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConvTranspose1d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConvTranspose2d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConvTranspose2d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConvTranspose2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ConvTranspose3d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_FractionalMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_GRU_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_GroupNorm_cuda_float16, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_GroupNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Hardswish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_HuberLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_InstanceNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_InstanceNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_InstanceNorm3d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_InstanceNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LPPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LPPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LSTM_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LayerNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LazyConv2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LazyConv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Linear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Linear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LocalResponseNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LogSigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_LogSoftmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MarginRankingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiLabelMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiLabelSoftMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_NLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_PReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_PoissonNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_RNN_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ReflectionPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ReplicationPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Sigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Softmax2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Softmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Softmin_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Softmin_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_Softsign_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_TransformerEncoderLayer_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_TransformerEncoder_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ZeroPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_ZeroPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_AdaptiveAvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_AdaptiveAvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_AdaptiveAvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_AdaptiveAvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_BCELoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_BatchNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_BatchNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Bilinear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_CircularPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_CircularPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConstantPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConstantPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Conv2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConvTranspose1d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConvTranspose3d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_forward_nn_ConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_CosineEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_FractionalMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_GELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_GRUCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_GRU_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_GroupNorm_cuda_bfloat16, test/test_modules.py::TestModuleCUDA::test_forward_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_InstanceNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_InstanceNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_InstanceNorm3d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_InstanceNorm3d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_KLDivLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_LSTM_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_LazyConv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_LeakyReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Linear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Linear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_LogSigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_LogSoftmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_MarginRankingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Mish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiLabelSoftMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_PReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_PReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_RNNCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_RNN_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReflectionPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReplicationPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReplicationPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReplicationPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_SELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_SiLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Sigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_SoftMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Softmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Softmin_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Softplus_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Softshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Tanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_Tanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Tanhshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_TransformerEncoderLayer_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_TransformerEncoder_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_TransformerEncoder_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_forward_nn_TransformerEncoder_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_Transformer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_forward_nn_ZeroPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_grad_nn_AdaptiveAvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_AdaptiveMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_ConstantPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_ConstantPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_ConvTranspose2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_CrossEntropyLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_GLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_GRUCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_GRU_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_InstanceNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_InstanceNorm3d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_LazyConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_LogSoftmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_MarginRankingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_MultiLabelSoftMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_NLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_RNN_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_ReflectionPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_Softshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_Tanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_Tanhshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_TransformerEncoderLayer_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_grad_nn_Transformer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_AvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_AvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Bilinear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_CircularPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_CircularPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_ConstantPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Conv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Conv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_ConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_ELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_FractionalMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_FractionalMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_GRU_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_GRU_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_GroupNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_L1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_LPPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_LPPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_LayerNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_LazyConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiLabelMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiheadAttention_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_NLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_RNN_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_SmoothL1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Softsign_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Threshold_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerDecoderLayer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerEncoderLayer_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerEncoderLayer_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerEncoder_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveAvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveAvgPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveAvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveMaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveMaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AdaptiveMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AvgPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_AvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_BatchNorm2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_BatchNorm3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Bilinear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Bilinear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CTCLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CircularPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CircularPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CircularPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConstantPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConstantPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConstantPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConstantPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Conv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Conv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose1d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose1d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose1d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose2d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose2d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose2d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CosineEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_CrossEntropyLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_GELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_GroupNorm_cuda_float16, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Hardshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_InstanceNorm1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_InstanceNorm3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_InstanceNorm3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LPPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LPPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LPPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LSTM_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LayerNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LazyConvTranspose2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Linear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LocalResponseNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_MSELoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_MultiLabelMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_MultiheadAttention_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_MultiheadAttention_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_RNN_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_RNN_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReflectionPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReplicationPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReplicationPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReplicationPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_SiLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_SiLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_SmoothL1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Softsign_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Tanhshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Tanhshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Threshold_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Threshold_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_TransformerDecoderLayer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_Transformer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ZeroPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_if_train_and_eval_modes_differ_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_AdaptiveAvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_AdaptiveAvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_AdaptiveMaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_AdaptiveMaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_AvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BCELoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BatchNorm1d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BatchNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BatchNorm3d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Bilinear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CTCLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Conv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Conv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ConvTranspose1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ConvTranspose2d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ConvTranspose2d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ConvTranspose2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ConvTranspose3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CosineEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CrossEntropyLoss_cuda_float16, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CrossEntropyLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_CrossEntropyLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Embedding_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_FractionalMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_FractionalMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GRUCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GRU_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GaussianNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_GroupNorm_cuda_bfloat16, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Hardswish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Hardswish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_InstanceNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_InstanceNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_KLDivLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LPPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LPPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LPPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LPPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LSTM_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LSTM_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LayerNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LazyConv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LazyConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LazyConvTranspose3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_LazyConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Linear_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Linear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MarginRankingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MultiLabelMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MultiMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_NLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_PReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_PoissonNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_PoissonNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_RNN_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_RNN_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ReflectionPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ReflectionPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ReplicationPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ReplicationPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_SELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_SiLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Sigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Sigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Softmax2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Softmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Softmin_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Softsign_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_Tanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_TransformerDecoderLayer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_TransformerEncoderLayer_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_TransformerEncoder_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_memory_format_nn_ZeroPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveAvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveAvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveAvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveMaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AvgPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_AvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm3d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CircularPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CircularPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CircularPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CircularPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConstantPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConstantPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Conv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Conv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Conv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConvTranspose1d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConvTranspose1d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConvTranspose1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConvTranspose3d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ConvTranspose3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CosineEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_CrossEntropyLoss_cuda_float16, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Embedding_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_FractionalMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_FractionalMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_GRUCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Hardswish_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_HingeEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_InstanceNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_InstanceNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_KLDivLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_L1Loss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_L1Loss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LPPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LPPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LSTMCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LSTM_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LSTM_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LayerNorm_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConvTranspose1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConvTranspose1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConvTranspose2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LazyConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LeakyReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LogSigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LogSoftmax_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_LogSoftmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_MSELoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_MarginRankingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_MaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_MultiheadAttention_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_MultiheadAttention_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_NLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_PReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_PReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_PoissonNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_RNN_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReLU6_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReflectionPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ReflectionPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_SELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_SiLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Softshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Tanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Threshold_cuda_float64, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Transformer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_ZeroPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_AdaptiveAvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_AdaptiveMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_AvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BCEWithLogitsLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BCEWithLogitsLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_BatchNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CircularPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CircularPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ConstantPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Conv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ConvTranspose1d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ConvTranspose2d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ConvTranspose2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ConvTranspose3d_cuda_complex32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CosineEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CosineEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CrossEntropyLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_CrossEntropyLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Embedding_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GRUCell_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GaussianNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GaussianNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GroupNorm_cuda_bfloat16, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_GroupNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Hardshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_HuberLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_HuberLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_InstanceNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_InstanceNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_InstanceNorm3d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_InstanceNorm3d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_KLDivLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LSTM_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LazyConv2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LazyConv2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LazyConv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LazyConvTranspose1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LeakyReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Linear_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_LogSigmoid_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MSELoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_NLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_RNNCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_RNN_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_RNN_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ReLU6_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ReflectionPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ReflectionPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_SiLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Sigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_SoftMarginLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Softmax2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Softmin_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Softmin_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Softshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Tanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Tanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Threshold_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Threshold_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerDecoderLayer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerEncoderLayer_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerEncoder_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Transformer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ZeroPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_ZeroPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AdaptiveAvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AdaptiveAvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AdaptiveMaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AdaptiveMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AdaptiveMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AvgPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_AvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_BCELoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_BatchNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_BatchNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_BatchNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_BatchNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CTCLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CircularPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ConstantPad1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ConstantPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Conv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Conv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ConvTranspose3d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CosineEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CosineEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CrossEntropyLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_CrossEntropyLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_FractionalMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_FractionalMaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GRU_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GRU_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GRU_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_GaussianNLLLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_HuberLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_InstanceNorm1d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_InstanceNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_InstanceNorm3d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_InstanceNorm3d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LPPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LPPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LPPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LSTMCell_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LSTM_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LSTM_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LazyConv1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LazyConv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_LogSigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MarginRankingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MaxPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Mish_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_PReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_PReLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_RNN_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ReLU6_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ReflectionPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ReflectionPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_SELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_SELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_SiLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_SiLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Sigmoid_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Softmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Softmin_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Softplus_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Softshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_TransformerEncoderLayer_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_TransformerEncoder_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_TransformerEncoder_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_Transformer_cuda_float32, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ZeroPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_pickle_nn_ZeroPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_AdaptiveAvgPool1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_AvgPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_AvgPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_AvgPool3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_BatchNorm1d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_BatchNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_BatchNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_BatchNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_CircularPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_CircularPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConstantPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConstantPad2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConstantPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConstantPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_Conv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Conv3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Conv3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConvTranspose1d_cuda_complex128, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConvTranspose2d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConvTranspose3d_cuda_complex64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConvTranspose3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ConvTranspose3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_CosineEmbeddingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_CrossEntropyLoss_cuda_float16, test/test_modules.py::TestModuleCUDA::test_repr_nn_CrossEntropyLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_Embedding_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Embedding_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_FractionalMaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_FractionalMaxPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_GELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_GLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_GRU_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_GroupNorm_cuda_bfloat16, test/test_modules.py::TestModuleCUDA::test_repr_nn_Hardshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_Hardtanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Hardtanh_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_HingeEmbeddingLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_HuberLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm1d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm1d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm2d_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm2d_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm2d_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm2d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_InstanceNorm3d_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_KLDivLoss_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_LPPool2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_LPPool3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_LSTM_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_LSTM_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_LayerNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_LazyConv1d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_LocalResponseNorm_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_LogSoftmax_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_MarginRankingLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_MaxPool2d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiLabelMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiMarginLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_train_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_PoissonNLLLoss_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_RNN_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_RNN_eval_mode_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_RNN_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReLU6_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReLU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReflectionPad1d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReflectionPad2d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReflectionPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReflectionPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReplicationPad3d_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ReplicationPad3d_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_SELU_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_SELU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_SiLU_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_Tanh_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Tanhshrink_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_Tanhshrink_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_TransformerDecoderLayer_cuda_float64, test/test_modules.py::TestModuleCUDA::test_repr_nn_TransformerEncoderLayer_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_TransformerEncoder_eval_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_TransformerEncoder_train_mode_cuda_float32, test/test_modules.py::TestModuleCUDA::test_repr_nn_ZeroPad3d_cuda_float64 2024-02-01T00:32:12.3075312Z 2024-02-01T00:32:12.3075763Z Running test_sparse_csr 2/3 ... [2024-02-01 00:32:12.222741] 2024-02-01T00:32:12.3077463Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '--shard-id=2', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:32:12.222941] 2024-02-01T00:39:31.9073497Z 2024-02-01T00:39:31.9076912Z test_sparse_csr 2/3 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_2.3_01c9fb68a17c872f_.log 2024-02-01T00:39:31.9850241Z Running 1636 items in this shard: test/test_sparse_csr.py::TestSparseCSRCUDA::test_add_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_25_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_3x3_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_5x7_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_5x7_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_5x7_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_dense_output_addmm_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_isnan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_round_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sqrt_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_tan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_baddbmm_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_baddbmm_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_False_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_False_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_False_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_False_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSC_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseBSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSC_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSC_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSR_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_matvec_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_storage_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_to_block_csr_blocksize_2_cuda_float64_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_to_block_csr_blocksize_4_cuda_float64_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_to_block_csr_errors_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSR_Batched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSR_NonBatched_Hybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseCSC_Batched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseCSR_Batched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseCSR_NonBatched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_matmul_device_mismatch_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mm_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_autograd_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_autograd_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_autograd_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_errors_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_errors_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_errors_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_nn_functional_relu_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_nn_functional_relu_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_nn_functional_relu_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sign_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sign_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_trunc_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_nn_functional_relu_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_mm_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_mm_reduce_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_to_sparse_compressed_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_triangular_solve_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_nn_functional_relu_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_nn_functional_relu_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_round_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_round_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sign_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_abs_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_ceil_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_deg2rad_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_mean_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_mean_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_mean_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_randn_like_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_randn_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_trunc_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erfinv_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_round_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_round_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_nn_functional_relu_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_dim_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSC_target_sparse_compressed_tensor_no_size_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSC_target_validate_sparse_compressed_tensor_args_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSR_target_sparse_compressed_tensor_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSR_target_validate_sparse_compressed_tensor_args_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseCSC_target_sparse_compressed_tensor_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseCSC_target_sparse_compressed_tensor_no_size_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseCSC_target_validate_sparse_compressed_tensor_args_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseCSR_target_sparse_compressed_tensor_no_size_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_64_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_64_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_16_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_16x32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_16x32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_2x3_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_softmax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_softmax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_addmm_blocksize_16_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_16x32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_16x32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_16_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_16x32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_16_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_16_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scatter_mm_cuda_float32 2024-02-01T00:39:32.0597589Z 2024-02-01T00:39:32.0598082Z Running inductor/test_cuda_repro 1/1 ... [2024-02-01 00:39:31.909918] 2024-02-01T00:39:32.0599834Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_repro.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:39:31.910129] 2024-02-01T00:40:39.9962039Z 2024-02-01T00:40:39.9963581Z inductor/test_cuda_repro 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_repro_1.1_3b8cfbddec01ee0b_.log 2024-02-01T00:40:39.9982772Z Running 45 items in this shard: test/inductor/test_cuda_repro.py::CudaReproTests::test_accuracy_issue1, test/inductor/test_cuda_repro.py::CudaReproTests::test_autotune_inplace_kernel, test/inductor/test_cuda_repro.py::CudaReproTests::test_backward_context, test/inductor/test_cuda_repro.py::CudaReproTests::test_bucketize_dynamic_dense, test/inductor/test_cuda_repro.py::CudaReproTests::test_deterministic_algorithms, test/inductor/test_cuda_repro.py::CudaReproTests::test_dtype_factory_issue, test/inductor/test_cuda_repro.py::CudaReproTests::test_dynamic_shapes, test/inductor/test_cuda_repro.py::CudaReproTests::test_dynamic_to_static_cudagraphs, test/inductor/test_cuda_repro.py::CudaReproTests::test_embedding_var_mean, test/inductor/test_cuda_repro.py::CudaReproTests::test_expanded_inputs_cudagraphs, test/inductor/test_cuda_repro.py::CudaReproTests::test_expanded_inputs_cudagraphs_no_size_asserts, test/inductor/test_cuda_repro.py::CudaReproTests::test_flash_attention_dynamic, test/inductor/test_cuda_repro.py::CudaReproTests::test_float64_constants, test/inductor/test_cuda_repro.py::CudaReproTests::test_full_copy, test/inductor/test_cuda_repro.py::CudaReproTests::test_index_put_cudagraph, test/inductor/test_cuda_repro.py::CudaReproTests::test_index_put_inplace_cudagraph, test/inductor/test_cuda_repro.py::CudaReproTests::test_index_put_issue, test/inductor/test_cuda_repro.py::CudaReproTests::test_index_put_no_fallback_cudagraph, test/inductor/test_cuda_repro.py::CudaReproTests::test_indirect_indexing_dense_mask, test/inductor/test_cuda_repro.py::CudaReproTests::test_inductor_output_aliases_intermediate, test/inductor/test_cuda_repro.py::CudaReproTests::test_inplace_add_alpha_autotune, test/inductor/test_cuda_repro.py::CudaReproTests::test_inplace_buffer_autotune, test/inductor/test_cuda_repro.py::CudaReproTests::test_inplace_updates_cudagraphs, test/inductor/test_cuda_repro.py::CudaReproTests::test_input_channels_last, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue100806, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue103461, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue103481, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue104759, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue97695_1input, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue97695_2input, test/inductor/test_cuda_repro.py::CudaReproTests::test_issue_103924, test/inductor/test_cuda_repro.py::CudaReproTests::test_linear_cpu_input, test/inductor/test_cuda_repro.py::CudaReproTests::test_linear_with_zero_infeature_size, test/inductor/test_cuda_repro.py::CudaReproTests::test_lookup_seed_backward, test/inductor/test_cuda_repro.py::CudaReproTests::test_memory_history_inductor, test/inductor/test_cuda_repro.py::CudaReproTests::test_multi_output_layout_fallback, test/inductor/test_cuda_repro.py::CudaReproTests::test_negative_arange_dynamic_shapes, test/inductor/test_cuda_repro.py::CudaReproTests::test_no_device_idx_repro_cudagraphs, test/inductor/test_cuda_repro.py::CudaReproTests::test_permute_fusion, test/inductor/test_cuda_repro.py::CudaReproTests::test_scalar_triton_index, test/inductor/test_cuda_repro.py::CudaReproTests::test_selecsls42b_misaligned_address, test/inductor/test_cuda_repro.py::CudaReproTests::test_simplify_dims, test/inductor/test_cuda_repro.py::CudaReproTests::test_sort_stride_issue, test/inductor/test_cuda_repro.py::CudaReproTests::test_unspec_inputs_interop, test/inductor/test_cuda_repro.py::CudaReproTests::test_xlnet_lm_stride_repro 2024-02-01T00:40:39.9999537Z 2024-02-01T00:40:39.9999855Z Running test_utils 1/1 ... [2024-02-01 00:40:39.996530] 2024-02-01T00:40:40.0001445Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:40:39.996752] 2024-02-01T00:41:11.7170248Z 2024-02-01T00:41:11.7171466Z test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_utils_1.1_e1b414475baab95c_.log 2024-02-01T00:41:11.9769511Z Running 5733 items in this shard: test/test_utils.py::TestCheckpoint::test_checkpoint, test/test_utils.py::TestCheckpoint::test_checkpoint_module_list, test/test_utils.py::TestCheckpoint::test_checkpoint_no_tensors, test/test_utils.py::TestCheckpoint::test_checkpoint_non_tensor, test/test_utils.py::TestCheckpoint::test_checkpoint_non_tensor_inputs_outputs, test/test_utils.py::TestCheckpoint::test_checkpoint_not_preserve_rng_state_and_without_reentrant, test/test_utils.py::TestCheckpoint::test_checkpoint_partial_grad, test/test_utils.py::TestCheckpoint::test_checkpoint_rng_cpu, test/test_utils.py::TestCheckpoint::test_checkpoint_rng_cuda, test/test_utils.py::TestCheckpoint::test_checkpoint_sequential_deprecated_multiple_args, test/test_utils.py::TestCheckpoint::test_checkpoint_sequential_deprecated_no_args, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_valid, test/test_utils.py::TestCheckpoint::test_checkpointing_without_reentrant_early_free, test/test_utils.py::TestDataLoaderUtils::test_multi_drop, test/test_utils.py::TestDataLoaderUtils::test_multi_keep, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_single_drop, test/test_utils.py::TestDataLoaderUtils::test_single_keep, test/test_utils.py::TestBottleneck::test_bottleneck_cpu_only, test/test_utils.py::TestBottleneck::test_bottleneck_cuda, test/test_utils.py::TestCollectEnv::test_smoke, test/test_utils.py::TestONNXUtils::test_check_onnx_broadcast, test/test_utils.py::TestONNXUtils::test_prepare_onnx_paddings, test/test_utils.py::TestHipify::test_import_hipify, test/test_utils.py::TestAssert::test_assert_scriptable, test/test_utils.py::TestAssert::test_assert_true, test/test_utils.py::TestStandaloneCPPJIT::test_load_standalone, test/test_utils.py::TestExtensionUtils::test_external_module_and_backend_register, test/test_utils.py::TestExtensionUtils::test_external_module_register, test/test_utils.py::TestRenderUtils::test_basic, test/test_utils.py::TestDeviceUtilsCUDA::test_basic_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_decorator_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_decorator_generator_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_H_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_T_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___getitem___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___radd___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rand___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rdiv___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmatmul___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmod___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rmul___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___ror___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rpow___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rsub___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops___rxor___cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__native_batch_norm_legit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__native_batch_norm_legit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__native_batch_norm_legit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__native_batch_norm_legit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_lengths_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_lengths_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_lengths_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_lengths_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_offsets_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_offsets_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_offsets_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__segment_reduce_offsets_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__softmax_backward_data_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__softmax_backward_data_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__softmax_backward_data_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__softmax_backward_data_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__upsample_bilinear2d_aa_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__upsample_bilinear2d_aa_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__upsample_bilinear2d_aa_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops__upsample_bilinear2d_aa_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_abs_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acos_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_acosh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_add_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addbmm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcdiv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addcmul_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmm_decomposed_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addmv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_addr_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_all_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_allclose_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_amin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_aminmax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_angle_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_any_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_arange_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argmin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argsort_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_argwhere_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_partial_views_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_as_strided_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_asinh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atan_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atanh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_1d_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_2d_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_atleast_3d_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_baddbmm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bernoulli_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bernoulli_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bernoulli_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bernoulli_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bfloat16_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bincount_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bincount_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bincount_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bincount_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bincount_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_and_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_left_shift_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_left_shift_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_left_shift_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_left_shift_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_left_shift_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_not_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_or_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_right_shift_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_right_shift_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_right_shift_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_right_shift_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_right_shift_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bitwise_xor_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_block_diag_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bmm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bool_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_shapes_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_tensors_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_broadcast_to_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_bucketize_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_byte_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cartesian_prod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cat_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cauchy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cauchy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cauchy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cauchy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdist_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdist_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cdouble_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ceil_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cfloat_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chalf_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_char_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_inverse_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_inverse_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_inverse_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_inverse_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cholesky_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_chunk_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_max_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clamp_min_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_clone_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_column_stack_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_combinations_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_complex_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_complex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_complex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_conj_physical_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_constant_pad_nd_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_contiguous_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_copysign_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_corrcoef_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cos_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cosh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_count_nonzero_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cov_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cross_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cummin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumprod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumsum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_cumulative_trapezoid_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_deg2rad_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diag_embed_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagflat_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_copy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diagonal_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_diff_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_digamma_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dist_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_floor_rounding_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_no_rounding_mode_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_div_trunc_rounding_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_double_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dsplit_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_dstack_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_einsum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_like_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_permuted_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_empty_strided_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eq_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_equal_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erf_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfc_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_erfinv_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exp_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_as_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expand_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_expm1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exponential_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exponential_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exponential_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_exponential_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_eye_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_fftshift_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_hfftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ifftshift_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_ihfftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_irfftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfft_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fft_rfftn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fill_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flatten_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flip_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fliplr_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_flipud_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_float_power_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_floor_divide_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_fmod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frac_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frac_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frac_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frac_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_frexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_full_like_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gather_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gcd_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gcd_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gcd_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gcd_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gcd_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ge_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geometric_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geqrf_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geqrf_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geqrf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_geqrf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gradient_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_grid_sampler_2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_grid_sampler_2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_grid_sampler_2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_grid_sampler_2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_gt_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_half_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_heaviside_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_histc_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hsplit_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hstack_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hypot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hypot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hypot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_hypot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_i0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_igamma_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_igamma_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_igammac_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_igammac_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_imag_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_imag_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_imag_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_add_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_copy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_fill_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_put_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_reduce_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_index_select_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_inner_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_int_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isclose_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isfinite_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isinf_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isnan_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isneginf_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isposinf_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_isreal_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_istft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_istft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_item_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_2inputs_2outputs_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_4inputs_with_extra_args_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_binary_return_by_ref_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_jiterator_unary_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kron_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_kthvalue_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lcm_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lcm_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lcm_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lcm_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lcm_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ldexp_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_le_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lerp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lgamma_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_ex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_ex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_ex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cholesky_ex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cond_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cond_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cond_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cond_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_cross_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_singular_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_singular_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_singular_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_det_singular_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_diagonal_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eig_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eig_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eig_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eig_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvals_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvals_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvals_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvals_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvalsh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvalsh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvalsh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_eigvalsh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_householder_product_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_householder_product_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_householder_product_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_householder_product_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_ex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_ex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_ex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_inv_ex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_ex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_ex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_ex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_factor_ex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_ldl_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_grad_oriented_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_grad_oriented_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_grad_oriented_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lstsq_grad_oriented_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_ex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_ex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_ex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_factor_ex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_lu_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_power_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_power_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_power_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_power_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_hermitian_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_hermitian_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_hermitian_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_matrix_rank_hermitian_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_multi_dot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_norm_subgradients_at_zero_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_hermitian_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_hermitian_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_hermitian_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_hermitian_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_singular_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_singular_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_singular_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_pinv_singular_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_qr_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_qr_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_qr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_qr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_slogdet_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_slogdet_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_slogdet_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_slogdet_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_ex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_ex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_ex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_ex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_triangular_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_triangular_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_triangular_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_solve_triangular_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svd_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svd_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svd_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svd_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svdvals_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svdvals_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svdvals_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_svdvals_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorinv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorinv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorinv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorinv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorsolve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorsolve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorsolve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_tensorsolve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vander_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vecdot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linalg_vector_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_linspace_tensor_overload_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log10_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log1p_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_normal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_normal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_normal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_normal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_log_softmax_with_dtype_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp2_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logaddexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logcumsumexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logdet_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logdet_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logdet_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logdet_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_and_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_not_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_or_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logical_xor_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logit_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logspace_tensor_overload_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_logsumexp_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_long_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lt_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_unpack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_unpack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_unpack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_lu_unpack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mH_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mT_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_amin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_argmin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumprod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_cumsum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_fill_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_log_softmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_log_softmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_log_softmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_log_softmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logaddexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logaddexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logaddexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logaddexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_logsumexp_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_mean_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_median_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_median_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_median_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_median_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_normalize_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_prod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_select_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_softmin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_std_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_sum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_masked_var_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matmul_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_matrix_exp_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_binary_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_pool2d_with_indices_backward_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_pool2d_with_indices_backward_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_pool2d_with_indices_backward_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_pool2d_with_indices_backward_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_no_dim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_max_reduction_with_dim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_maximum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_median_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_list_of_tensors_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_meshgrid_variadic_tensors_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_binary_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_no_dim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_min_reduction_with_dim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_minimum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mode_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_movedim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_msort_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mul_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_multinomial_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_multinomial_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_multinomial_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_multinomial_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mv_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_3_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_mvlgamma_mvlgamma_p_5_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nan_to_num_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanmedian_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanquantile_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nanquantile_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nansum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_copy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_narrow_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_batch_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_batch_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_batch_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_batch_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_dropout_backward_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_dropout_backward_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_dropout_backward_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_dropout_backward_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_layer_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_layer_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_layer_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_native_layer_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ne_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_neg_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_empty_strided_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_full_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_ones_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_new_zeros_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nextafter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nextafter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nextafter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_avg_pool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_adaptive_max_pool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_alpha_dropout_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_alpha_dropout_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_alpha_dropout_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_alpha_dropout_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_avg_pool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_without_cudnn_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_without_cudnn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_without_cudnn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_batch_norm_without_cudnn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_bilinear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_bilinear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_bilinear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_bilinear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_with_logits_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_with_logits_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_celu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_celu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_celu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_celu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_conv_transpose3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_embedding_loss_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_similarity_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_similarity_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_similarity_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cosine_similarity_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cross_entropy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cross_entropy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cross_entropy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_cross_entropy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_ctc_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_ctc_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_dropout_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_elu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_elu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_elu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_elu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_bag_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_bag_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_bag_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_bag_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_embedding_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_with_train_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_with_train_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_feature_alpha_dropout_without_train_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_fractional_max_pool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gaussian_nll_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gaussian_nll_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gaussian_nll_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gaussian_nll_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gelu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gelu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gelu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_gelu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_glu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_glu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_glu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_glu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_grid_sample_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_grid_sample_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_grid_sample_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_grid_sample_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_group_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_group_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_group_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_group_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardshrink_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardshrink_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardshrink_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardshrink_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardsigmoid_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardsigmoid_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardsigmoid_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardsigmoid_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardswish_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardswish_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardswish_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardswish_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hardtanh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hinge_embedding_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hinge_embedding_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hinge_embedding_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_hinge_embedding_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_huber_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_huber_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_huber_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_huber_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_instance_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_instance_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_instance_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_instance_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_area_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_area_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_area_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_area_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bicubic_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bicubic_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bicubic_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bicubic_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bilinear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bilinear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bilinear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_bilinear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_linear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_linear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_linear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_linear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest-exact_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest-exact_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest-exact_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest-exact_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest-exact_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_nearest_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_trilinear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_trilinear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_trilinear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_interpolate_trilinear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_kl_div_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_kl_div_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_kl_div_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_kl_div_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_l1_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_layer_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_layer_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_layer_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_layer_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_leaky_relu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_leaky_relu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_leaky_relu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_leaky_relu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_linear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_local_response_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_local_response_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_local_response_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_local_response_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_logsigmoid_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_logsigmoid_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_logsigmoid_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_logsigmoid_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_margin_ranking_loss_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool1d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool2d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool3d_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_pool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_grad_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_grad_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool1d_grad_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_grad_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_grad_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool2d_grad_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_grad_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_grad_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_max_unpool3d_grad_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mish_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mish_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mish_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mish_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mse_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mse_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mse_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_mse_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_head_attention_forward_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_head_attention_forward_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_head_attention_forward_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_head_attention_forward_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_margin_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_margin_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_margin_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multi_margin_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_margin_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_margin_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_margin_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_margin_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_soft_margin_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_soft_margin_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_soft_margin_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_nll_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_nll_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_nll_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_nll_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_normalize_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_one_hot_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_circular_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_constant_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_reflect_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pad_replicate_negative_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pairwise_distance_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pdist_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pdist_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_shuffle_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_pixel_unshuffle_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_poisson_nll_loss_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_prelu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_prelu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_prelu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_prelu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu6_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_relu_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_rrelu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_rrelu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_rrelu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_rrelu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_scaled_dot_product_attention_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_scaled_dot_product_attention_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_scaled_dot_product_attention_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_scaled_dot_product_attention_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_selu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_selu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_selu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_selu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_complex_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_complex_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_silu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_smooth_l1_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_smooth_l1_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_smooth_l1_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_smooth_l1_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_soft_margin_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_soft_margin_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_soft_margin_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_soft_margin_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softmin_with_dtype_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softplus_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softplus_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softplus_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softplus_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softshrink_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softshrink_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softshrink_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softshrink_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_softsign_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_tanhshrink_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_threshold_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_loss_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_triplet_margin_with_distance_loss_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_unfold_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_bilinear_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_bilinear_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_bilinear_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_bilinear_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_nearest_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_nearest_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_nearest_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_nearest_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nn_functional_upsample_nearest_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_nonzero_static_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_fro_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_inf_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_nuc_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_nuc_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_nuc_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_norm_nuc_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_in_place_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_number_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_number_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_number_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_normal_number_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ones_like_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ormqr_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ormqr_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ormqr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ormqr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_outer_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pca_lowrank_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pca_lowrank_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_permute_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pinverse_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pinverse_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pinverse_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pinverse_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polar_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polar_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_2_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_3_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_polygamma_polygamma_n_4_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_positive_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_pow_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_prod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_put_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_qr_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_qr_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_qr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_qr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_quantile_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_quantile_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rad2deg_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rand_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randint_like_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_randn_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_ravel_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_real_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reciprocal_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_remainder_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_renorm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_repeat_interleave_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_as_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_reshape_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize__cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resize_as__cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_conj_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_resolve_neg_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_roll_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rot90_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_0_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_0_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_3_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_3_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_3_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_3_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_neg_3_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_neg_3_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_neg_3_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_round_decimals_neg_3_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsqrt_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_rsub_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scalar_tensor_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_add_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amax_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_amin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_mean_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_prod_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_scatter_reduce_sum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_searchsorted_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_select_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sgn_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_short_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sigmoid_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sign_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_bartlett_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_bartlett_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_blackman_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_blackman_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_cosine_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_cosine_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_exponential_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_exponential_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_gaussian_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_gaussian_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_general_cosine_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_general_cosine_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_general_hamming_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_general_hamming_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_hamming_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_hamming_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_hann_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_hann_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_kaiser_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_kaiser_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_nuttall_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signal_windows_nuttall_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_signbit_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sin_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinc_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sinh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_slice_scatter_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_softmax_with_dtype_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sort_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_mm_reduce_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_mm_reduce_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_mm_reduce_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_sampled_addmm_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_sampled_addmm_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_sampled_addmm_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sparse_sampled_addmm_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_airy_ai_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_j1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_bessel_y1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_t_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_u_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_v_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_chebyshev_polynomial_w_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_entr_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_erfcx_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_h_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_hermite_polynomial_he_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i0e_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_i1e_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_laguerre_polynomial_l_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_legendre_polynomial_p_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_log_ndtr_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_i1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_modified_bessel_k1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtr_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_ndtri_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_polygamma_special_polygamma_n_0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_scaled_modified_bessel_k1_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_t_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_u_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_v_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_shifted_chebyshev_polynomial_w_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_spherical_bessel_j0_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_xlog1py_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_special_zeta_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_list_args_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_split_with_sizes_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sqrt_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_square_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_squeeze_multiple_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stack_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_mean_unbiased_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_std_unbiased_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stft_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stft_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stft_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_stft_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sub_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_sum_to_size_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_lowrank_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_svd_lowrank_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_t_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_along_dim_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_take_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tan_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tanh_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensor_split_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tensordot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tile_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_to_sparse_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_topk_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch__scaled_mm_cuda_float8_e4m3fn, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch_ops_aten__efficient_attention_forward_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch_ops_aten__efficient_attention_forward_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch_ops_aten__efficient_attention_forward_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch_ops_aten__flash_attention_forward_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_torch_ops_aten__flash_attention_forward_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trace_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_transpose_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapezoid_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trapz_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triangular_solve_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triangular_solve_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triangular_solve_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triangular_solve_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_indices_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_tril_indices_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_indices_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_triu_indices_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_true_divide_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_trunc_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unbind_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unflatten_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_copy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unfold_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_uniform_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_consecutive_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unique_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unravel_index_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unravel_index_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unravel_index_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unravel_index_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unravel_index_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_chunk_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsafe_split_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_unsqueeze_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_mean_unbiased_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_var_unbiased_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vdot_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_complex_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_complex_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_complex_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_real_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_as_real_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_copy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_view_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vsplit_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_vstack_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_where_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_xlogy_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zero__cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_bfloat16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_bool, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_complex128, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_complex32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_complex64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_float16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_float32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_float64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_int16, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_int32, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_int64, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_int8, test/test_utils.py::TestDeviceUtilsCUDA::test_device_mode_ops_zeros_like_cuda_uint8, test/test_utils.py::TestDeviceUtilsCUDA::test_get_default_device_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_get_default_device_more_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_nn_module_cuda, test/test_utils.py::TestDeviceUtilsCUDA::test_set_default_device_cuda, test/test_utils.py::TestCppExtensionUtils::test_cc_compiler_is_ok, test/test_utils.py::TestCppExtensionUtils::test_cpp_compiler_is_ok, test/test_utils.py::TestTraceback::test_basic, test/test_utils.py::TestTraceback::test_captured_traceback, test/test_utils.py::TestTraceback::test_captured_traceback_format_all, test/test_utils.py::TestTraceback::test_captured_traceback_format_all_cached, test/test_utils.py::TestTraceback::test_format_traceback_short 2024-02-01T00:41:12.1995010Z 2024-02-01T00:41:12.1995473Z Running inductor/test_max_autotune 1/1 ... [2024-02-01 00:41:11.726806] 2024-02-01T00:41:12.1997227Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_max_autotune.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:41:11.727050] 2024-02-01T00:44:49.9897359Z 2024-02-01T00:44:49.9901573Z inductor/test_max_autotune 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_max_autotune_1.1_9abe6235ce9a3f04_.log 2024-02-01T00:44:49.9922399Z Running 38 items in this shard: test/inductor/test_max_autotune.py::TestMaxAutotune::test_autotune_conv1x1, test/inductor/test_max_autotune.py::TestMaxAutotune::test_benchmark_choice_fail_in_subproc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_benchmark_choice_in_subproc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_cat_addmm, test/inductor/test_max_autotune.py::TestMaxAutotune::test_matmul_dropout_device_cpu, test/inductor/test_max_autotune.py::TestMaxAutotune::test_matmul_dropout_device_cuda, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_dynamic_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_dynamic_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_zero_size_input_dynamic_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_zero_size_input_dynamic_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_addmm_dynamic_False_max_autotune_gemm_backends_ATen,Triton,CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_addmm_dynamic_False_max_autotune_gemm_backends_CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_chained_fusion_fp16, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_chained_fusion_fp16_fp32acc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_mm_bias_dynamic_False_max_autotune_gemm_backends_ATen,Triton,CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_mm_bias_dynamic_False_max_autotune_gemm_backends_CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_no_fusion_dtype_mismatch, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_regular_mm_dynamic_False_max_autotune_gemm_backends_ATen,Triton,CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_regular_mm_dynamic_False_max_autotune_gemm_backends_CUTLASS, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_relu6_fusion_fp16_fp32acc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_relu_fusion_fp16, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_relu_fusion_fp16_fp32acc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_shape_dependent_normalization_fusion, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_simple_fusion_fp16, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_simple_fusion_fp16_fp32acc, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_autotune_in_subproc_False_autotune_multi_device_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_autotune_in_subproc_False_autotune_multi_device_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_autotune_in_subproc_True_autotune_multi_device_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_autotune_in_subproc_True_autotune_multi_device_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_zero_size_input_dynamic_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_mm_plus_mm_zero_size_input_dynamic_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_dynamic_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_dynamic_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_zero_size_input_dynamic_False, test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_zero_size_input_dynamic_True, test/inductor/test_max_autotune.py::TestMaxAutotune::test_triton_template_with_epilogues_and_dynamic_shape, test/inductor/test_max_autotune.py::TestTuningProcess::test_tuning_pool_crash, test/inductor/test_max_autotune.py::TestTuningProcess::test_tuning_pool_multiple_devices 2024-02-01T00:44:49.9942119Z 2024-02-01T00:44:49.9942495Z Running inductor/test_cpu_repro 1/1 ... [2024-02-01 00:44:49.989851] 2024-02-01T00:44:49.9944224Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpu_repro.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:44:49.990082] 2024-02-01T00:51:26.2029648Z 2024-02-01T00:51:26.2031370Z inductor/test_cpu_repro 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpu_repro_1.1_c9a55710b62dae71_.log 2024-02-01T00:51:26.2077431Z Running 119 items in this shard: test/inductor/test_cpu_repro.py::CPUReproTests::test_ModularIndexing_range_issue_103133, test/inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d, test/inductor/test_cpu_repro.py::CPUReproTests::test_acosh_with_negative_large_input, test/inductor/test_cpu_repro.py::CPUReproTests::test_aten_normal_dtype, test/inductor/test_cpu_repro.py::CPUReproTests::test_atomic_add_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_attention_size_mismatch, test/inductor/test_cpu_repro.py::CPUReproTests::test_auto_simd, test/inductor/test_cpu_repro.py::CPUReproTests::test_bf16_zeros, test/inductor/test_cpu_repro.py::CPUReproTests::test_broadcast_mul_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_cat_mul, test/inductor/test_cpu_repro.py::CPUReproTests::test_channel_shuffle_cl_output, test/inductor/test_cpu_repro.py::CPUReproTests::test_complex_memory_overlap, test/inductor/test_cpu_repro.py::CPUReproTests::test_concat_inner_vec, test/inductor/test_cpu_repro.py::CPUReproTests::test_constant_store, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv2d_autocast, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv2d_bn_mixed_dtype, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv2d_packed, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv_stride_constraints, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv_transpose2d_has_output_size_input, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv_transpose2d_packed_cpu, test/inductor/test_cpu_repro.py::CPUReproTests::test_conv_used_from_multiple_places, test/inductor/test_cpu_repro.py::CPUReproTests::test_cpp_kernel_profile, test/inductor/test_cpu_repro.py::CPUReproTests::test_cpp_vec_constant_checker, test/inductor/test_cpu_repro.py::CPUReproTests::test_cpp_vec_index_expr_checker, test/inductor/test_cpu_repro.py::CPUReproTests::test_cpu_vec_cosim, test/inductor/test_cpu_repro.py::CPUReproTests::test_decomposed_dequant_relu_quant, test/inductor/test_cpu_repro.py::CPUReproTests::test_dequant_maxpool2d_lowering, test/inductor/test_cpu_repro.py::CPUReproTests::test_dequant_quant_lowering, test/inductor/test_cpu_repro.py::CPUReproTests::test_dequant_relu_quant_dequant_relu_quant_lowering, test/inductor/test_cpu_repro.py::CPUReproTests::test_do_not_insert_to_dtype_for_memory_copy_only_kernel, test/inductor/test_cpu_repro.py::CPUReproTests::test_eliminate_meaningless_copy, test/inductor/test_cpu_repro.py::CPUReproTests::test_embedding_vec, test/inductor/test_cpu_repro.py::CPUReproTests::test_embedding_vec_bf16, test/inductor/test_cpu_repro.py::CPUReproTests::test_expr_vec_non_contiguous, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp32_load_with_to_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_bfloat16_shape_15,3,13, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_bfloat16_shape_4,2048,4096, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_float16_shape_15,3,13, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_float16_shape_4,2048,4096, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_float32_shape_15,3,13, test/inductor/test_cpu_repro.py::CPUReproTests::test_fp8_cast_float32_shape_4,2048,4096, test/inductor/test_cpu_repro.py::CPUReproTests::test_group_norm_vec, test/inductor/test_cpu_repro.py::CPUReproTests::test_horizontal_fusion, test/inductor/test_cpu_repro.py::CPUReproTests::test_in_out_buffer, test/inductor/test_cpu_repro.py::CPUReproTests::test_index_propagation_issue_102065, test/inductor/test_cpu_repro.py::CPUReproTests::test_inplace_add_alpha, test/inductor/test_cpu_repro.py::CPUReproTests::test_inplace_squeeze_needed, test/inductor/test_cpu_repro.py::CPUReproTests::test_insert_to_dtype_count, test/inductor/test_cpu_repro.py::CPUReproTests::test_int_div, test/inductor/test_cpu_repro.py::CPUReproTests::test_int_div_vec, test/inductor/test_cpu_repro.py::CPUReproTests::test_invalid_dropout_args, test/inductor/test_cpu_repro.py::CPUReproTests::test_invalid_index_of_empty_tensor, test/inductor/test_cpu_repro.py::CPUReproTests::test_ir_node_str, test/inductor/test_cpu_repro.py::CPUReproTests::test_linear_buffer_reuse, test/inductor/test_cpu_repro.py::CPUReproTests::test_linear_packed, test/inductor/test_cpu_repro.py::CPUReproTests::test_linear_used_from_multiple_places, test/inductor/test_cpu_repro.py::CPUReproTests::test_linear_with_no_default_contiguous_input, test/inductor/test_cpu_repro.py::CPUReproTests::test_linear_with_reshape, test/inductor/test_cpu_repro.py::CPUReproTests::test_load_inf_bf16, test/inductor/test_cpu_repro.py::CPUReproTests::test_load_same_bool_tensor_twice, test/inductor/test_cpu_repro.py::CPUReproTests::test_logical_op_store_to_lowp_data_dtype, test/inductor/test_cpu_repro.py::CPUReproTests::test_lowp_fp_neg_abs, test/inductor/test_cpu_repro.py::CPUReproTests::test_lstm_packed, test/inductor/test_cpu_repro.py::CPUReproTests::test_lstm_packed_change_input_sizes_cpu, test/inductor/test_cpu_repro.py::CPUReproTests::test_masked_fill_softmax, test/inductor/test_cpu_repro.py::CPUReproTests::test_masked_fill_with_inf_or_nan_value, test/inductor/test_cpu_repro.py::CPUReproTests::test_max_reduction_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_maxpool2d_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_maxpool2d_with_pre_loop_collapse_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_memory_copy_with_fusion, test/inductor/test_cpu_repro.py::CPUReproTests::test_multihead_attention_cpu, test/inductor/test_cpu_repro.py::CPUReproTests::test_new_vec_op_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_no_op_squeeze, test/inductor/test_cpu_repro.py::CPUReproTests::test_non_contiguous_index_with_constant_stride, test/inductor/test_cpu_repro.py::CPUReproTests::test_non_contiguous_load_buf_quant, test/inductor/test_cpu_repro.py::CPUReproTests::test_non_contiguous_reduction_store, test/inductor/test_cpu_repro.py::CPUReproTests::test_pack_padded_sequence_lstm, test/inductor/test_cpu_repro.py::CPUReproTests::test_pad_with_nan_value, test/inductor/test_cpu_repro.py::CPUReproTests::test_parallel_num_threads, test/inductor/test_cpu_repro.py::CPUReproTests::test_pow_cos, test/inductor/test_cpu_repro.py::CPUReproTests::test_reduce_with_masked, test/inductor/test_cpu_repro.py::CPUReproTests::test_reduction_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_redundant_to_node_elimination_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_relu_with_inf_value, test/inductor/test_cpu_repro.py::CPUReproTests::test_repeat_interleave, test/inductor/test_cpu_repro.py::CPUReproTests::test_scalar_mul_bfloat16, test/inductor/test_cpu_repro.py::CPUReproTests::test_scalar_sign_with_min, test/inductor/test_cpu_repro.py::CPUReproTests::test_scatter_using_atomic_add, test/inductor/test_cpu_repro.py::CPUReproTests::test_select_tiliing_with_index_expr, test/inductor/test_cpu_repro.py::CPUReproTests::test_sigmoid_with_reduction, test/inductor/test_cpu_repro.py::CPUReproTests::test_sign_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_skip_cpp_codegen, test/inductor/test_cpu_repro.py::CPUReproTests::test_slice_scatter_default_end_value, test/inductor/test_cpu_repro.py::CPUReproTests::test_tile2d_load_decomposed_dequant_add_relu_quant, test/inductor/test_cpu_repro.py::CPUReproTests::test_tile2d_store_channel_shuffle_cl_quant_output, test/inductor/test_cpu_repro.py::CPUReproTests::test_timed_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_to_channels_last_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_to_dtype_bool_float, test/inductor/test_cpu_repro.py::CPUReproTests::test_to_dtype_float_bool, test/inductor/test_cpu_repro.py::CPUReproTests::test_to_uint8_rounding_method, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_copy, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_mxn_16_16_bf16_fp16, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_mxn_32_32_bf16_fp16, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_non_contiguous, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_sum2d_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_sum_outer, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_vertical_sum_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm, test/inductor/test_cpu_repro.py::CPUReproTests::test_uint8_add, test/inductor/test_cpu_repro.py::CPUReproTests::test_uint8_sub, test/inductor/test_cpu_repro.py::CPUReproTests::test_unsupported_conv_transpose, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_compare_op_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_contiguous_ModularIndexing, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_cpu_only_for_all_available_isa, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_dynamic_shapes, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_kernel_cpu_only, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_logical, test/inductor/test_cpu_repro.py::CPUReproTests::test_vec_transpose_lowp_fp, test/inductor/test_cpu_repro.py::CPUReproTests::test_vertical_sum_cpu_only 2024-02-01T00:51:26.2120526Z 2024-02-01T00:51:26.2120928Z Running inductor/test_fused_attention 1/1 ... [2024-02-01 00:51:26.203626] 2024-02-01T00:51:26.2122722Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_fused_attention.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:51:26.203814] 2024-02-01T00:53:39.7589083Z 2024-02-01T00:53:39.7590404Z inductor/test_fused_attention 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_fused_attention_1.1_d50cfb2f1b2deb0f_.log 2024-02-01T00:53:39.7629014Z Running 76 items in this shard: test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_pattern_fails_with_reuse_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_pattern_fails_with_tensor_factor_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_pattern_fails_with_unsupported_mask_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_prev_13_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_prev_14_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_prev_15_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_10_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_11_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_12_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_13_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_14_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_15_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_17_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_1_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_2_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_3_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_4_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_5_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_6_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_7_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_8_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaTests::test_sdpa_rewriter_9_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_pattern_fails_with_reuse_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_pattern_fails_with_tensor_factor_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_pattern_fails_with_unsupported_mask_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_prev_13_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_prev_14_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_prev_15_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_10_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_11_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_12_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_13_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_14_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_15_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_17_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_1_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_2_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_3_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_4_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_5_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_6_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_7_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_8_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCudaDynamicTests::test_sdpa_rewriter_9_cuda, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_pattern_fails_with_reuse_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_pattern_fails_with_tensor_factor_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_pattern_fails_with_unsupported_mask_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_prev_13_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_prev_14_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_prev_15_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_12_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_13_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_14_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_15_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_16_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_17_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_1_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_2_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_5_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_pattern_fails_with_reuse_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_pattern_fails_with_tensor_factor_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_pattern_fails_with_unsupported_mask_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_prev_13_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_prev_14_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_prev_15_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_11_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_12_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_13_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_14_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_15_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_16_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_17_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_1_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_2_cpu, test/inductor/test_fused_attention.py::SDPAPatternRewriterCpuDynamicTests::test_sdpa_rewriter_5_cpu 2024-02-01T00:53:39.7665506Z 2024-02-01T00:53:39.7665867Z Running dynamo/test_modules 1/1 ... [2024-02-01 00:53:39.758955] 2024-02-01T00:53:39.7667533Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_modules.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:53:39.759162] 2024-02-01T00:53:51.4434707Z 2024-02-01T00:53:51.4436026Z dynamo/test_modules 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_modules_1.1_e570c46d6bddb566_.log 2024-02-01T00:53:51.4469885Z Running 98 items in this shard: test/dynamo/test_modules.py::NNModuleTests::test_access_by_keys, test/dynamo/test_modules.py::NNModuleTests::test_basicmodule1, test/dynamo/test_modules.py::NNModuleTests::test_basicmodule2, test/dynamo/test_modules.py::NNModuleTests::test_call_fn_with_non_const_inputs_safe, test/dynamo/test_modules.py::NNModuleTests::test_cfgmod, test/dynamo/test_modules.py::NNModuleTests::test_children, test/dynamo/test_modules.py::NNModuleTests::test_constloop, test/dynamo/test_modules.py::NNModuleTests::test_conv_call_forward_directly, test/dynamo/test_modules.py::NNModuleTests::test_conv_call_super_forward_directly, test/dynamo/test_modules.py::NNModuleTests::test_conv_transpose_call_forward_directly, test/dynamo/test_modules.py::NNModuleTests::test_conv_transpose_call_super_forward_directly, test/dynamo/test_modules.py::NNModuleTests::test_densenet, test/dynamo/test_modules.py::NNModuleTests::test_enumvalues, test/dynamo/test_modules.py::NNModuleTests::test_fnmember, test/dynamo/test_modules.py::NNModuleTests::test_fnmembercmp1, test/dynamo/test_modules.py::NNModuleTests::test_fnmembercmp2, test/dynamo/test_modules.py::NNModuleTests::test_forward_directly, test/dynamo/test_modules.py::NNModuleTests::test_generation_tag, test/dynamo/test_modules.py::NNModuleTests::test_hasattr, test/dynamo/test_modules.py::NNModuleTests::test_intarg, test/dynamo/test_modules.py::NNModuleTests::test_iseval1, test/dynamo/test_modules.py::NNModuleTests::test_iseval2, test/dynamo/test_modules.py::NNModuleTests::test_isnonelayer, test/dynamo/test_modules.py::NNModuleTests::test_istraining1, test/dynamo/test_modules.py::NNModuleTests::test_istraining2, test/dynamo/test_modules.py::NNModuleTests::test_layerlist, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module1, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module2, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module3, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module4, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module5, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module6, test/dynamo/test_modules.py::NNModuleTests::test_lazy_module_no_cls_to_become, test/dynamo/test_modules.py::NNModuleTests::test_module_attribute_precedence, test/dynamo/test_modules.py::NNModuleTests::test_module_call_module_with_static_forward, test/dynamo/test_modules.py::NNModuleTests::test_module_class_method, test/dynamo/test_modules.py::NNModuleTests::test_module_comparison, test/dynamo/test_modules.py::NNModuleTests::test_module_forward_has_graph_break, test/dynamo/test_modules.py::NNModuleTests::test_module_guard_name_is_valid, test/dynamo/test_modules.py::NNModuleTests::test_module_name_string, test/dynamo/test_modules.py::NNModuleTests::test_module_property, test/dynamo/test_modules.py::NNModuleTests::test_module_static_method, test/dynamo/test_modules.py::NNModuleTests::test_moduledict, test/dynamo/test_modules.py::NNModuleTests::test_moduledict_custom, test/dynamo/test_modules.py::NNModuleTests::test_modulelist, test/dynamo/test_modules.py::NNModuleTests::test_modulelist_custom, test/dynamo/test_modules.py::NNModuleTests::test_modulelist_nested, test/dynamo/test_modules.py::NNModuleTests::test_modulemethod1, test/dynamo/test_modules.py::NNModuleTests::test_modulemethod2, test/dynamo/test_modules.py::NNModuleTests::test_named_children, test/dynamo/test_modules.py::NNModuleTests::test_nn_moduledict_contains, test/dynamo/test_modules.py::NNModuleTests::test_parameterdict, test/dynamo/test_modules.py::NNModuleTests::test_parameterdict_custom, test/dynamo/test_modules.py::NNModuleTests::test_parameters1, test/dynamo/test_modules.py::NNModuleTests::test_parameters2, test/dynamo/test_modules.py::NNModuleTests::test_parameters3, test/dynamo/test_modules.py::NNModuleTests::test_self_mutating1, test/dynamo/test_modules.py::NNModuleTests::test_seq, test/dynamo/test_modules.py::NNModuleTests::test_sequential_with_duplicated_module, test/dynamo/test_modules.py::NNModuleTests::test_sequential_with_duplicated_module2, test/dynamo/test_modules.py::NNModuleTests::test_simple_torch_function, test/dynamo/test_modules.py::NNModuleTests::test_stringmember, test/dynamo/test_modules.py::NNModuleTests::test_submodules1, test/dynamo/test_modules.py::NNModuleTests::test_submodules2, test/dynamo/test_modules.py::NNModuleTests::test_super1, test/dynamo/test_modules.py::NNModuleTests::test_super2, test/dynamo/test_modules.py::NNModuleTests::test_super_class_method, test/dynamo/test_modules.py::NNModuleTests::test_tensorlist, test/dynamo/test_modules.py::NNModuleTests::test_torch_function_with_closure, test/dynamo/test_modules.py::NNModuleTests::test_unsupportedmethod, test/dynamo/test_modules.py::NNModuleTests::test_unsupportedmodule, test/dynamo/test_modules.py::NNModuleTests::test_viamodulecall, test/dynamo/test_modules.py::OptimizedModuleTest::test_assign_does_not_exist, test/dynamo/test_modules.py::OptimizedModuleTest::test_attr, test/dynamo/test_modules.py::OptimizedModuleTest::test_backward_hooks, test/dynamo/test_modules.py::OptimizedModuleTest::test_cache_size_limit_on_guarded_nn_modules, test/dynamo/test_modules.py::OptimizedModuleTest::test_composition, test/dynamo/test_modules.py::OptimizedModuleTest::test_composition_with_opt_mod, test/dynamo/test_modules.py::OptimizedModuleTest::test_dir, test/dynamo/test_modules.py::OptimizedModuleTest::test_dunder_call_explicitly, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_allowed_modules, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_allowed_modules_compiles, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_allowed_modules_compiles_self_contained, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_inner, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_outer, test/dynamo/test_modules.py::OptimizedModuleTest::test_hooks_skip_guards, test/dynamo/test_modules.py::OptimizedModuleTest::test_module_dict_iter_keys, test/dynamo/test_modules.py::OptimizedModuleTest::test_module_dict_iter_name, test/dynamo/test_modules.py::OptimizedModuleTest::test_module_dict_iter_values, test/dynamo/test_modules.py::OptimizedModuleTest::test_module_patch, test/dynamo/test_modules.py::OptimizedModuleTest::test_nn_module, test/dynamo/test_modules.py::OptimizedModuleTest::test_no_guard_on_torch_nn_modules, test/dynamo/test_modules.py::OptimizedModuleTest::test_no_op_assignment, test/dynamo/test_modules.py::OptimizedModuleTest::test_no_recompile_on_nn_guarded_modules, test/dynamo/test_modules.py::OptimizedModuleTest::test_recursion, test/dynamo/test_modules.py::OptimizedModuleTest::test_to, test/dynamo/test_modules.py::OptimizedModuleTest::test_unspec_non_inlinable_module, test/dynamo/test_modules.py::OptimizedModuleTest::test_unspecialized_seq 2024-02-01T00:53:51.4501965Z 2024-02-01T00:53:51.4502352Z Running dynamo/test_functions 1/1 ... [2024-02-01 00:53:51.443665] 2024-02-01T00:53:51.4504072Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_functions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:53:51.443899] 2024-02-01T00:54:10.7933977Z 2024-02-01T00:54:10.7935502Z dynamo/test_functions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_functions_1.1_e8e2c654a4c91fd2_.log 2024-02-01T00:54:10.8024687Z Running 268 items in this shard: test/dynamo/test_functions.py::FunctionTests::test_T, test/dynamo/test_functions.py::FunctionTests::test_add, test/dynamo/test_functions.py::FunctionTests::test_add_, test/dynamo/test_functions.py::FunctionTests::test_addcdiv, test/dynamo/test_functions.py::FunctionTests::test_addcdiv_, test/dynamo/test_functions.py::FunctionTests::test_build_list_unpack, test/dynamo/test_functions.py::FunctionTests::test_call_dict1, test/dynamo/test_functions.py::FunctionTests::test_call_dict2, test/dynamo/test_functions.py::FunctionTests::test_call_dict3, test/dynamo/test_functions.py::FunctionTests::test_call_dict4, test/dynamo/test_functions.py::FunctionTests::test_call_dict5, test/dynamo/test_functions.py::FunctionTests::test_callable_builtin, test/dynamo/test_functions.py::FunctionTests::test_callable_lambda, test/dynamo/test_functions.py::FunctionTests::test_callable_torch, test/dynamo/test_functions.py::FunctionTests::test_chunks1, test/dynamo/test_functions.py::FunctionTests::test_compare_constant_and_tensor, test/dynamo/test_functions.py::FunctionTests::test_complex_closure, test/dynamo/test_functions.py::FunctionTests::test_const_tuple_add1, test/dynamo/test_functions.py::FunctionTests::test_const_tuple_add2, test/dynamo/test_functions.py::FunctionTests::test_constant1, test/dynamo/test_functions.py::FunctionTests::test_constant2, test/dynamo/test_functions.py::FunctionTests::test_constant3, test/dynamo/test_functions.py::FunctionTests::test_constant4, test/dynamo/test_functions.py::FunctionTests::test_context_wrapping_nested_functions_no_closure, test/dynamo/test_functions.py::FunctionTests::test_cublas_allow_tf32, test/dynamo/test_functions.py::FunctionTests::test_custom_dict_kwargs, test/dynamo/test_functions.py::FunctionTests::test_default_dict, test/dynamo/test_functions.py::FunctionTests::test_default_dict_closure, test/dynamo/test_functions.py::FunctionTests::test_default_dict_constr, test/dynamo/test_functions.py::FunctionTests::test_default_dict_lambda, test/dynamo/test_functions.py::FunctionTests::test_del, test/dynamo/test_functions.py::FunctionTests::test_deque, test/dynamo/test_functions.py::FunctionTests::test_device, test/dynamo/test_functions.py::FunctionTests::test_device_constant, test/dynamo/test_functions.py::FunctionTests::test_dict_copy, test/dynamo/test_functions.py::FunctionTests::test_dict_fromkeys, test/dynamo/test_functions.py::FunctionTests::test_dict_keys, test/dynamo/test_functions.py::FunctionTests::test_dict_kwargs, test/dynamo/test_functions.py::FunctionTests::test_dict_ops, test/dynamo/test_functions.py::FunctionTests::test_dict_param_keys, test/dynamo/test_functions.py::FunctionTests::test_dict_sorted, test/dynamo/test_functions.py::FunctionTests::test_dict_tuple_lazy_guard, test/dynamo/test_functions.py::FunctionTests::test_dict_update, test/dynamo/test_functions.py::FunctionTests::test_dict_values, test/dynamo/test_functions.py::FunctionTests::test_distributed_is_available, test/dynamo/test_functions.py::FunctionTests::test_distributed_is_initialized, test/dynamo/test_functions.py::FunctionTests::test_dtype, test/dynamo/test_functions.py::FunctionTests::test_dtype_compare, test/dynamo/test_functions.py::FunctionTests::test_finfo, test/dynamo/test_functions.py::FunctionTests::test_float, test/dynamo/test_functions.py::FunctionTests::test_fn_with_self_set, test/dynamo/test_functions.py::FunctionTests::test_fstrings1, test/dynamo/test_functions.py::FunctionTests::test_fstrings2, test/dynamo/test_functions.py::FunctionTests::test_fstrings3, test/dynamo/test_functions.py::FunctionTests::test_funcdef_closure, test/dynamo/test_functions.py::FunctionTests::test_functools_partial, test/dynamo/test_functions.py::FunctionTests::test_get_autocast_gpu_dtype, test/dynamo/test_functions.py::FunctionTests::test_get_calculate_correct_fan, test/dynamo/test_functions.py::FunctionTests::test_get_default_dtype, test/dynamo/test_functions.py::FunctionTests::test_get_privateuse1_name, test/dynamo/test_functions.py::FunctionTests::test_globalfn, test/dynamo/test_functions.py::FunctionTests::test_globalmodule, test/dynamo/test_functions.py::FunctionTests::test_globalvar, test/dynamo/test_functions.py::FunctionTests::test_import1, test/dynamo/test_functions.py::FunctionTests::test_indirect1, test/dynamo/test_functions.py::FunctionTests::test_indirect2, test/dynamo/test_functions.py::FunctionTests::test_indirect3, test/dynamo/test_functions.py::FunctionTests::test_inline_jit__unwrap_optional, test/dynamo/test_functions.py::FunctionTests::test_inline_jit_annotations, test/dynamo/test_functions.py::FunctionTests::test_inline_softmax, test/dynamo/test_functions.py::FunctionTests::test_inline_with_default, test/dynamo/test_functions.py::FunctionTests::test_inner_function, test/dynamo/test_functions.py::FunctionTests::test_is_complex, test/dynamo/test_functions.py::FunctionTests::test_is_contiguous_frame_counts, test/dynamo/test_functions.py::FunctionTests::test_is_contiguous_memory_format, test/dynamo/test_functions.py::FunctionTests::test_is_floating_point, test/dynamo/test_functions.py::FunctionTests::test_is_fx_tracing, test/dynamo/test_functions.py::FunctionTests::test_is_in_onnx_export, test/dynamo/test_functions.py::FunctionTests::test_is_not_null, test/dynamo/test_functions.py::FunctionTests::test_is_quantized, test/dynamo/test_functions.py::FunctionTests::test_is_sparse, test/dynamo/test_functions.py::FunctionTests::test_islice_chain, test/dynamo/test_functions.py::FunctionTests::test_itertools_chain, test/dynamo/test_functions.py::FunctionTests::test_itertools_chain_from_iterable, test/dynamo/test_functions.py::FunctionTests::test_itertools_combinations, test/dynamo/test_functions.py::FunctionTests::test_itertools_product, test/dynamo/test_functions.py::FunctionTests::test_jit_annotate, test/dynamo/test_functions.py::FunctionTests::test_len_constant_dict, test/dynamo/test_functions.py::FunctionTests::test_len_constant_list, test/dynamo/test_functions.py::FunctionTests::test_len_constant_misc_iterables, test/dynamo/test_functions.py::FunctionTests::test_len_tensor, test/dynamo/test_functions.py::FunctionTests::test_list_add, test/dynamo/test_functions.py::FunctionTests::test_list_clear, test/dynamo/test_functions.py::FunctionTests::test_list_convert, test/dynamo/test_functions.py::FunctionTests::test_list_index_with_constant_tensor, test/dynamo/test_functions.py::FunctionTests::test_list_reversed, test/dynamo/test_functions.py::FunctionTests::test_list_slice_assignment, test/dynamo/test_functions.py::FunctionTests::test_list_sorted1, test/dynamo/test_functions.py::FunctionTests::test_list_sorted2, test/dynamo/test_functions.py::FunctionTests::test_list_truth, test/dynamo/test_functions.py::FunctionTests::test_listarg1, test/dynamo/test_functions.py::FunctionTests::test_listarg2, test/dynamo/test_functions.py::FunctionTests::test_listarg3, test/dynamo/test_functions.py::FunctionTests::test_listarg4, test/dynamo/test_functions.py::FunctionTests::test_listarg5, test/dynamo/test_functions.py::FunctionTests::test_load_global_bool, test/dynamo/test_functions.py::FunctionTests::test_mT, test/dynamo/test_functions.py::FunctionTests::test_manual_seed, test/dynamo/test_functions.py::FunctionTests::test_map_sum, test/dynamo/test_functions.py::FunctionTests::test_math_radians, test/dynamo/test_functions.py::FunctionTests::test_mean_sum_np, test/dynamo/test_functions.py::FunctionTests::test_methodcall1, test/dynamo/test_functions.py::FunctionTests::test_methodcall2, test/dynamo/test_functions.py::FunctionTests::test_methodcall3, test/dynamo/test_functions.py::FunctionTests::test_min_max, test/dynamo/test_functions.py::FunctionTests::test_module_constant, test/dynamo/test_functions.py::FunctionTests::test_namedtuple, test/dynamo/test_functions.py::FunctionTests::test_namedtuple_defaults, test/dynamo/test_functions.py::FunctionTests::test_namedtuple_user_methods, test/dynamo/test_functions.py::FunctionTests::test_ndarray_builtin_functions, test/dynamo/test_functions.py::FunctionTests::test_ndarray_method, test/dynamo/test_functions.py::FunctionTests::test_ndarray_methods_returning_scalar, test/dynamo/test_functions.py::FunctionTests::test_ndarray_reshape, test/dynamo/test_functions.py::FunctionTests::test_ndarray_transpose, test/dynamo/test_functions.py::FunctionTests::test_ndim, test/dynamo/test_functions.py::FunctionTests::test_no_recompile_inner_function, test/dynamo/test_functions.py::FunctionTests::test_no_recompile_inner_lambda, test/dynamo/test_functions.py::FunctionTests::test_non_inlined_closure, test/dynamo/test_functions.py::FunctionTests::test_not_list, test/dynamo/test_functions.py::FunctionTests::test_numpy_attributes, test/dynamo/test_functions.py::FunctionTests::test_numpy_dtype_argument_to_function, test/dynamo/test_functions.py::FunctionTests::test_numpy_fft, test/dynamo/test_functions.py::FunctionTests::test_numpy_linalg, test/dynamo/test_functions.py::FunctionTests::test_numpy_meshgrid, test/dynamo/test_functions.py::FunctionTests::test_numpy_random, test/dynamo/test_functions.py::FunctionTests::test_numpy_size, test/dynamo/test_functions.py::FunctionTests::test_ordered_dict_kwargs, test/dynamo/test_functions.py::FunctionTests::test_partial_across_graph_break_uninvoked, test/dynamo/test_functions.py::FunctionTests::test_partials_as_input_UDF, test/dynamo/test_functions.py::FunctionTests::test_partials_as_input_partials_lambda, test/dynamo/test_functions.py::FunctionTests::test_partials_as_input_partials_mod, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___annotations__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___builtins__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___call__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___class__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___closure__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___code__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___defaults__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___delattr__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___dict__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___dir__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___doc__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___eq__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___format__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___ge__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___get__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___getattribute__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___globals__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___gt__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___hash__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___init__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___init_subclass__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___kwdefaults__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___le__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___lt__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___module__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___name__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___ne__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___new__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___qualname__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___reduce__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___reduce_ex__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___repr__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___setattr__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___sizeof__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___str__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr___subclasshook__, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr_args, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr_func, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_attr_keywords, test/dynamo/test_functions.py::FunctionTests::test_partials_hasattr_set_attr, test/dynamo/test_functions.py::FunctionTests::test_partials_lambda, test/dynamo/test_functions.py::FunctionTests::test_partials_recompilation, test/dynamo/test_functions.py::FunctionTests::test_partials_torch_op_arg, test/dynamo/test_functions.py::FunctionTests::test_partials_torch_op_kwarg, test/dynamo/test_functions.py::FunctionTests::test_partials_udf_arg, test/dynamo/test_functions.py::FunctionTests::test_partials_udf_kwarg, test/dynamo/test_functions.py::FunctionTests::test_partials_udf_kwarg_method, test/dynamo/test_functions.py::FunctionTests::test_partials_udf_kwarg_module, test/dynamo/test_functions.py::FunctionTests::test_pop, test/dynamo/test_functions.py::FunctionTests::test_pos, test/dynamo/test_functions.py::FunctionTests::test_pow_int, test/dynamo/test_functions.py::FunctionTests::test_promote_types, test/dynamo/test_functions.py::FunctionTests::test_range1, test/dynamo/test_functions.py::FunctionTests::test_range2, test/dynamo/test_functions.py::FunctionTests::test_reduce, test/dynamo/test_functions.py::FunctionTests::test_reduce_with_initial, test/dynamo/test_functions.py::FunctionTests::test_reduce_with_none_initial, test/dynamo/test_functions.py::FunctionTests::test_reduce_with_single, test/dynamo/test_functions.py::FunctionTests::test_reduce_with_single_with_initial, test/dynamo/test_functions.py::FunctionTests::test_return_dict, test/dynamo/test_functions.py::FunctionTests::test_return_dict2, test/dynamo/test_functions.py::FunctionTests::test_return_multiple_numpy_ndarray, test/dynamo/test_functions.py::FunctionTests::test_return_numpy_ndarray, test/dynamo/test_functions.py::FunctionTests::test_return_tuple1, test/dynamo/test_functions.py::FunctionTests::test_return_tuple2, test/dynamo/test_functions.py::FunctionTests::test_set_contains, test/dynamo/test_functions.py::FunctionTests::test_shape1, test/dynamo/test_functions.py::FunctionTests::test_shape2, test/dynamo/test_functions.py::FunctionTests::test_slice1, test/dynamo/test_functions.py::FunctionTests::test_slice2, test/dynamo/test_functions.py::FunctionTests::test_slice3, test/dynamo/test_functions.py::FunctionTests::test_slice4, test/dynamo/test_functions.py::FunctionTests::test_slice5, test/dynamo/test_functions.py::FunctionTests::test_slice6, test/dynamo/test_functions.py::FunctionTests::test_startswith, test/dynamo/test_functions.py::FunctionTests::test_sum, test/dynamo/test_functions.py::FunctionTests::test_sum_shortcut, test/dynamo/test_functions.py::FunctionTests::test_sum_shortcut_with_start_arg, test/dynamo/test_functions.py::FunctionTests::test_sum_shortcut_with_start_kwarg, test/dynamo/test_functions.py::FunctionTests::test_sum_with_start_arg, test/dynamo/test_functions.py::FunctionTests::test_sum_with_start_kwarg, test/dynamo/test_functions.py::FunctionTests::test_symbool_to_int, test/dynamo/test_functions.py::FunctionTests::test_tensor_element_size, test/dynamo/test_functions.py::FunctionTests::test_tensor_len, test/dynamo/test_functions.py::FunctionTests::test_tensor_new_with_shape, test/dynamo/test_functions.py::FunctionTests::test_tensor_new_with_size, test/dynamo/test_functions.py::FunctionTests::test_tensor_size_indexed_by_symint, test/dynamo/test_functions.py::FunctionTests::test_tensor_type, test/dynamo/test_functions.py::FunctionTests::test_tensor_type2, test/dynamo/test_functions.py::FunctionTests::test_tensor_type3, test/dynamo/test_functions.py::FunctionTests::test_tensor_type4, test/dynamo/test_functions.py::FunctionTests::test_tensor_type5, test/dynamo/test_functions.py::FunctionTests::test_torch_distributions_functions, test/dynamo/test_functions.py::FunctionTests::test_torch_from_numpy, test/dynamo/test_functions.py::FunctionTests::test_transpose_for_scores, test/dynamo/test_functions.py::FunctionTests::test_tuple1, test/dynamo/test_functions.py::FunctionTests::test_tuple2, test/dynamo/test_functions.py::FunctionTests::test_tuple_contains, test/dynamo/test_functions.py::FunctionTests::test_tuple_iadd, test/dynamo/test_functions.py::FunctionTests::test_tuple_sorted, test/dynamo/test_functions.py::FunctionTests::test_unary_fold_op, test/dynamo/test_functions.py::FunctionTests::test_unpack1, test/dynamo/test_functions.py::FunctionTests::test_unpack2, test/dynamo/test_functions.py::FunctionTests::test_unpack3, test/dynamo/test_functions.py::FunctionTests::test_unpack_ex1, test/dynamo/test_functions.py::FunctionTests::test_unpack_ex2, test/dynamo/test_functions.py::FunctionTests::test_unpack_ex3, test/dynamo/test_functions.py::FunctionTests::test_viamethod, test/dynamo/test_functions.py::FunctionTests::test_viatorch, test/dynamo/test_functions.py::DefaultsTests::test_cast_tensor_single_elem, test/dynamo/test_functions.py::DefaultsTests::test_dataclass_factory, test/dynamo/test_functions.py::DefaultsTests::test_dataclass_nested, test/dynamo/test_functions.py::DefaultsTests::test_func_default_tensor_args, test/dynamo/test_functions.py::DefaultsTests::test_func_default_torch_args, test/dynamo/test_functions.py::DefaultsTests::test_in_set_inplace, test/dynamo/test_functions.py::DefaultsTests::test_in_set_would_fail_broadcast, test/dynamo/test_functions.py::DefaultsTests::test_is_init_in_compile_mutated_tensor_tensor, test/dynamo/test_functions.py::DefaultsTests::test_is_init_in_compile_vmapped_mutated_tensor_tensor, test/dynamo/test_functions.py::DefaultsTests::test_is_init_in_compile_vmapped_mutated_tensor_tensor_multi_arg, test/dynamo/test_functions.py::DefaultsTests::test_is_mutated_tensor_tensor, test/dynamo/test_functions.py::DefaultsTests::test_is_mutated_tensor_tensor_across_graph_break, test/dynamo/test_functions.py::DefaultsTests::test_is_tensor_tensor, test/dynamo/test_functions.py::DefaultsTests::test_is_vmapped_mutated_tensor_tensor, test/dynamo/test_functions.py::DefaultsTests::test_listlike_of_tensors_contains_constant, test/dynamo/test_functions.py::DefaultsTests::test_meth_default_tensor_args, test/dynamo/test_functions.py::DefaultsTests::test_set_construction, test/dynamo/test_functions.py::DefaultsTests::test_zip_strict 2024-02-01T00:54:10.8110954Z 2024-02-01T00:54:10.8111377Z Running dynamo/test_minifier 1/1 ... [2024-02-01 00:54:10.793726] 2024-02-01T00:54:10.8113146Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_minifier.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:54:10.793939] 2024-02-01T00:54:15.2665847Z 2024-02-01T00:54:15.2667361Z dynamo/test_minifier 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_minifier_1.1_a9472406c7688bd7_.log 2024-02-01T00:54:15.2675016Z Running 15 items in this shard: test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_accuracy_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_accuracy_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_compile_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_compile_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_runtime_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cpu_runtime_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_accuracy_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_accuracy_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_compile_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_compile_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_runtime_backend_passes, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_cuda_runtime_error, test/dynamo/test_minifier.py::MinifierTests::test_after_dynamo_non_leaf_compile_error, test/dynamo/test_minifier.py::MinifierTests::test_cpu_cuda_module_after_dynamo, test/dynamo/test_minifier.py::MinifierTests::test_if_graph_minified 2024-02-01T00:54:15.2681550Z 2024-02-01T00:54:15.2681954Z Running inductor/test_inductor_freezing 1/1 ... [2024-02-01 00:54:15.266666] 2024-02-01T00:54:15.2683817Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inductor_freezing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:54:15.266903] 2024-02-01T00:55:19.9443439Z 2024-02-01T00:55:19.9445534Z inductor/test_inductor_freezing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inductor_freezing_1.1_edd56ea19d17f41c_.log 2024-02-01T00:55:19.9461726Z Running 34 items in this shard: test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_aliased_param_return_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_autocast_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_conv_layout_convert_with_view_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_conv_multiple_uses_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_conv_weight_layout_convert_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_conv_with_as_strided_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_cpp_wrapper_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_dont_change_dtype_folding_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_error_on_eager_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_folded_conv_bn_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_mm_concat_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_mutation_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_param_deallocated_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_redundant_clone_for_layout_convert_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_rng_op_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_symint_not_folded_cpu, test/inductor/test_inductor_freezing.py::FreezingCpuTests::test_unfolded_bn_cpu, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_aliased_param_return_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_autocast_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_conv_layout_convert_with_view_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_conv_multiple_uses_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_conv_weight_layout_convert_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_conv_with_as_strided_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_cpp_wrapper_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_dont_change_dtype_folding_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_error_on_eager_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_folded_conv_bn_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_mm_concat_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_mutation_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_param_deallocated_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_redundant_clone_for_layout_convert_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_rng_op_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_symint_not_folded_cuda, test/inductor/test_inductor_freezing.py::FreezingCudaTests::test_unfolded_bn_cuda 2024-02-01T00:55:19.9475760Z 2024-02-01T00:55:19.9476193Z Running inductor/test_triton_wrapper 1/1 ... [2024-02-01 00:55:19.944379] 2024-02-01T00:55:19.9478056Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_triton_wrapper.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:55:19.944654] 2024-02-01T00:55:27.6729810Z 2024-02-01T00:55:27.6733047Z inductor/test_triton_wrapper 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_triton_wrapper_1.1_fb0c84dfb1ac3649_.log 2024-02-01T00:55:27.6734624Z Running 1 items in this shard: test/inductor/test_triton_wrapper.py::TestTritonWrapper::test_wrapper_using_gpu_seed 2024-02-01T00:55:27.6735302Z 2024-02-01T00:55:27.6735619Z Running export/test_export 1/1 ... [2024-02-01 00:55:27.672797] 2024-02-01T00:55:27.6737274Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_export.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:55:27.673016] 2024-02-01T00:55:45.2827118Z 2024-02-01T00:55:45.2828578Z export/test_export 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_export_1.1_09fddb13d4b6a652_.log 2024-02-01T00:55:45.2863577Z Running 94 items in this shard: test/export/test_export.py::TestDynamismExpression::test_export_assume_static_by_default, test/export/test_export.py::TestDynamismExpression::test_export_constraints_error, test/export/test_export.py::TestDynamismExpression::test_export_inline_constraints, test/export/test_export.py::TestExport::test__scaled_dot_product_flash_attention, test/export/test_export.py::TestExport::test_args_type_checked, test/export/test_export.py::TestExport::test_automatic_constrain_size, test/export/test_export.py::TestExport::test_basic, test/export/test_export.py::TestExport::test_basic_non_strict_fake_tensor, test/export/test_export.py::TestExport::test_basic_non_strict_real_tensor, test/export/test_export.py::TestExport::test_buffer_util, test/export/test_export.py::TestExport::test_check_specialized_int, test/export/test_export.py::TestExport::test_cond_buffers, test/export/test_export.py::TestExport::test_cond_with_module_stack_export_with, test/export/test_export.py::TestExport::test_cond_with_module_stack_export_with_unflatten, test/export/test_export.py::TestExport::test_constant_output, test/export/test_export.py::TestExport::test_constrain_as_size_error, test/export/test_export.py::TestExport::test_constrain_decomp, test/export/test_export.py::TestExport::test_constrain_size_in_eager, test/export/test_export.py::TestExport::test_constrain_size_with_constrain_value, test/export/test_export.py::TestExport::test_constrain_size_with_various_cases, test/export/test_export.py::TestExport::test_constrain_value_with_no_default, test/export/test_export.py::TestExport::test_constrain_value_with_symfloat, test/export/test_export.py::TestExport::test_constraint_directly_construct, test/export/test_export.py::TestExport::test_decomp_batch_norm_functional_predispatch, test/export/test_export.py::TestExport::test_dynamic_shapes_spec_with_pytree, test/export/test_export.py::TestExport::test_error_does_not_reference_eager_fallback, test/export/test_export.py::TestExport::test_export_api_with_dynamic_shapes, test/export/test_export.py::TestExport::test_export_cond_preserve_stack_trace_for_subgraphs, test/export/test_export.py::TestExport::test_export_decomps_dynamic, test/export/test_export.py::TestExport::test_export_decomps_simple, test/export/test_export.py::TestExport::test_export_dynamo_config, test/export/test_export.py::TestExport::test_export_func_with_default_kwargs, test/export/test_export.py::TestExport::test_export_func_with_keyword_only_args, test/export/test_export.py::TestExport::test_export_func_with_kwargs, test/export/test_export.py::TestExport::test_export_func_with_pytree_kwargs, test/export/test_export.py::TestExport::test_export_func_with_var_keyword_args, test/export/test_export.py::TestExport::test_export_func_with_var_keyword_pytree_args, test/export/test_export.py::TestExport::test_export_func_with_var_postional_args, test/export/test_export.py::TestExport::test_export_graph_with_no_inputs, test/export/test_export.py::TestExport::test_export_input_mutation_dynamic_shape, test/export/test_export.py::TestExport::test_export_input_mutation_static_shape, test/export/test_export.py::TestExport::test_export_mkldnn_disabled, test/export/test_export.py::TestExport::test_export_mod_constraints, test/export/test_export.py::TestExport::test_export_then_compile_tensor_ctor, test/export/test_export.py::TestExport::test_export_with_fake_tensor_inputs, test/export/test_export.py::TestExport::test_export_with_fake_tensor_inputs_on_cuda_devices, test/export/test_export.py::TestExport::test_export_with_inline_constraints, test/export/test_export.py::TestExport::test_export_with_inline_constraints_complex, test/export/test_export.py::TestExport::test_export_with_wrong_inputs, test/export/test_export.py::TestExport::test_external_call_non_strict_real_tensor, test/export/test_export.py::TestExport::test_fqn, test/export/test_export.py::TestExport::test_if_functional, test/export/test_export.py::TestExport::test_issue_113041, test/export/test_export.py::TestExport::test_lazy_module_kwargs, test/export/test_export.py::TestExport::test_lifted_constants, test/export/test_export.py::TestExport::test_linear_conv, test/export/test_export.py::TestExport::test_map, test/export/test_export.py::TestExport::test_map_buffers, test/export/test_export.py::TestExport::test_mixed_input, test/export/test_export.py::TestExport::test_module, test/export/test_export.py::TestExport::test_module_with_dict_container_inp_out, test/export/test_export.py::TestExport::test_multiple_definitions_same_name_dim, test/export/test_export.py::TestExport::test_nn_module_stack, test/export/test_export.py::TestExport::test_nn_module_stack_shared_submodule, test/export/test_export.py::TestExport::test_non_arg_name_dynamic_shapes_api, test/export/test_export.py::TestExport::test_non_arg_name_dynamic_shapes_api_with_container_type, test/export/test_export.py::TestExport::test_non_arg_name_dynamic_shapes_api_with_kwarg, test/export/test_export.py::TestExport::test_non_strict_dynamic_shapes, test/export/test_export.py::TestExport::test_non_strict_dynamic_shapes_suggested_fixes, test/export/test_export.py::TestExport::test_nonzero_2, test/export/test_export.py::TestExport::test_not_correct_dim, test/export/test_export.py::TestExport::test_pad_sequence, test/export/test_export.py::TestExport::test_param_util, test/export/test_export.py::TestExport::test_preserve_shape_dynamism_for_unused_inputs, test/export/test_export.py::TestExport::test_pytree_register_data_class, test/export/test_export.py::TestExport::test_pytree_register_nested_data_class, test/export/test_export.py::TestExport::test_raise_user_error_when_guard_on_data_dependent_operation, test/export/test_export.py::TestExport::test_redundant_asserts, test/export/test_export.py::TestExport::test_retracable_ep, test/export/test_export.py::TestExport::test_retrace_graph_level_meta_preservation, test/export/test_export.py::TestExport::test_retrace_pre_autograd, test/export/test_export.py::TestExport::test_run_decomposition_supports_user_input_mutation, test/export/test_export.py::TestExport::test_runtime_assert_for_prim, test/export/test_export.py::TestExport::test_runtime_assert_for_prm_str, test/export/test_export.py::TestExport::test_sym_sqrt, test/export/test_export.py::TestExport::test_to_module_with_mutated_buffer, test/export/test_export.py::TestExport::test_to_module_with_mutated_buffer_multiple, test/export/test_export.py::TestExport::test_to_module_with_mutated_buffer_multiple_update_sub_later, test/export/test_export.py::TestExport::test_train_eval_on_exported_preautograd_module, test/export/test_export.py::TestOneOffModelExportResult::test_none_input_output, test/export/test_export.py::TestOneOffModelExportResult::test_primitive_constant_output, test/export/test_export.py::TestOneOffModelExportResult::test_scaled_dot_product_attention_cpu, test/export/test_export.py::TestOneOffModelExportResult::test_scaled_dot_product_attention_cuda, test/export/test_export.py::TestExportCustomClass::test_lift_custom_obj 2024-02-01T00:55:45.2896391Z 2024-02-01T00:55:45.2896753Z Running test_python_dispatch 1/1 ... [2024-02-01 00:55:45.282894] 2024-02-01T00:55:45.2898544Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_python_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:55:45.283094] 2024-02-01T00:55:51.5587009Z 2024-02-01T00:55:51.5588228Z test_python_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_python_dispatch_1.1_85676099dbac8bdb_.log 2024-02-01T00:55:51.5633419Z Running 108 items in this shard: test/test_python_dispatch.py::TestDispatcherPythonBindings::test_call_boxed, test/test_python_dispatch.py::TestPythonRegistration::test_alias_analysis, test/test_python_dispatch.py::TestPythonRegistration::test_create_new_library, test/test_python_dispatch.py::TestPythonRegistration::test_create_new_library_fragment_no_existing, test/test_python_dispatch.py::TestPythonRegistration::test_create_new_library_fragment_with_existing, test/test_python_dispatch.py::TestPythonRegistration::test_error_for_unsupported_ns_or_kind, test/test_python_dispatch.py::TestPythonRegistration::test_error_if_fn_not_callable, test/test_python_dispatch.py::TestPythonRegistration::test_extend_library_with_dispatch_key_arg, test/test_python_dispatch.py::TestPythonRegistration::test_finalizer, test/test_python_dispatch.py::TestPythonRegistration::test_override_aten_ops_with_multiple_libraries, test/test_python_dispatch.py::TestPythonRegistration::test_override_cpu_sum, test/test_python_dispatch.py::TestPythonRegistration::test_override_cuda_with_jiterator, test/test_python_dispatch.py::TestPythonRegistration::test_register_fallthrough, test/test_python_dispatch.py::TestPythonRegistration::test_register_functional_op_error_cases, test/test_python_dispatch.py::TestPythonRegistration::test_register_functional_op_multiple_returns, test/test_python_dispatch.py::TestPythonRegistration::test_register_functional_op_no_returns, test/test_python_dispatch.py::TestPythonRegistration::test_register_functional_op_one_return, test/test_python_dispatch.py::TestPythonRegistration::test_register_functional_op_with_optional, test/test_python_dispatch.py::TestPythonRegistration::test_returning_symint, test/test_python_dispatch.py::TestPythonDispatch::test_all_same_mode, test/test_python_dispatch.py::TestPythonDispatch::test_autograd_in_attr, test/test_python_dispatch.py::TestPythonDispatch::test_basic, test/test_python_dispatch.py::TestPythonDispatch::test_capture_logs_with_torch_dispatch_mode, test/test_python_dispatch.py::TestPythonDispatch::test_construct_int_tensor, test/test_python_dispatch.py::TestPythonDispatch::test_custom_autograd, test/test_python_dispatch.py::TestPythonDispatch::test_custom_size_policy_dynamic_shapes, test/test_python_dispatch.py::TestPythonDispatch::test_data_ptr_respects_numel_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_non_wrapper_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass_with_clone_returning_different_type, test/test_python_dispatch.py::TestPythonDispatch::test_detach_appears_twice_when_called_once, test/test_python_dispatch.py::TestPythonDispatch::test_device_slowpath, test/test_python_dispatch.py::TestPythonDispatch::test_dim_slowpath, test/test_python_dispatch.py::TestPythonDispatch::test_dispatch_super_call, test/test_python_dispatch.py::TestPythonDispatch::test_dispatch_super_call_list_arg, test/test_python_dispatch.py::TestPythonDispatch::test_dispatch_super_dont_autograd, test/test_python_dispatch.py::TestPythonDispatch::test_error_using_class_method_on_mode, test/test_python_dispatch.py::TestPythonDispatch::test_exception_handling, test/test_python_dispatch.py::TestPythonDispatch::test_fancy_strides, test/test_python_dispatch.py::TestPythonDispatch::test_format, test/test_python_dispatch.py::TestPythonDispatch::test_get_cur_mode, test/test_python_dispatch.py::TestPythonDispatch::test_get_mode_stack, test/test_python_dispatch.py::TestPythonDispatch::test_index_put_where_only_index_is_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_invalid_ret, test/test_python_dispatch.py::TestPythonDispatch::test_is_contiguous_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_kwarg_only, test/test_python_dispatch.py::TestPythonDispatch::test_kwarg_only_and_positional_default, test/test_python_dispatch.py::TestPythonDispatch::test_layout_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_like, test/test_python_dispatch.py::TestPythonDispatch::test_list_ret, test/test_python_dispatch.py::TestPythonDispatch::test_make_fx_with_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_make_subclass_with_modes, test/test_python_dispatch.py::TestPythonDispatch::test_make_wrapper_subclass_noalloc, test/test_python_dispatch.py::TestPythonDispatch::test_make_wrapper_subclass_propagates_metadata, test/test_python_dispatch.py::TestPythonDispatch::test_maybe_tuple_bug, test/test_python_dispatch.py::TestPythonDispatch::test_mode_with_make_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_multiple_ops_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_nested_push_logging_tensor_mode, test/test_python_dispatch.py::TestPythonDispatch::test_nesting_same_mode, test/test_python_dispatch.py::TestPythonDispatch::test_new_ones, test/test_python_dispatch.py::TestPythonDispatch::test_none_wrapping, test/test_python_dispatch.py::TestPythonDispatch::test_notimplemented_mode, test/test_python_dispatch.py::TestPythonDispatch::test_optional_tensor_list, test/test_python_dispatch.py::TestPythonDispatch::test_out, test/test_python_dispatch.py::TestPythonDispatch::test_produce_real_type, test/test_python_dispatch.py::TestPythonDispatch::test_record_stream, test/test_python_dispatch.py::TestPythonDispatch::test_return_stream, test/test_python_dispatch.py::TestPythonDispatch::test_set_data, test/test_python_dispatch.py::TestPythonDispatch::test_shallow_copy_and_detach, test/test_python_dispatch.py::TestPythonDispatch::test_sizes_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_standard_is_not_subclass, test/test_python_dispatch.py::TestPythonDispatch::test_storage, test/test_python_dispatch.py::TestPythonDispatch::test_storage_can_be_converted_to_python_object, test/test_python_dispatch.py::TestPythonDispatch::test_strides_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_subclass_autograd_device_check, test/test_python_dispatch.py::TestPythonDispatch::test_subclass_creation, test/test_python_dispatch.py::TestPythonDispatch::test_subclass_priority, test/test_python_dispatch.py::TestPythonDispatch::test_sym_sizes_strides_slow_path, test/test_python_dispatch.py::TestPythonDispatch::test_tolist_numpy_with_torch_dispatch_mode, test/test_python_dispatch.py::TestPythonDispatch::test_torch_dispatch_mode_basic, test/test_python_dispatch.py::TestPythonDispatch::test_torch_dispatch_mode_respects_no_dispatch, test/test_python_dispatch.py::TestPythonDispatch::test_torch_dispatch_mode_subclass_priority, test/test_python_dispatch.py::TestPythonDispatch::test_torch_dispatch_mode_unrelated_tensors, test/test_python_dispatch.py::TestPythonDispatch::test_version, test/test_python_dispatch.py::TestPythonDispatch::test_with_mode_created_separately, test/test_python_dispatch.py::TestPythonDispatch::test_with_nested_modes, test/test_python_dispatch.py::TestPythonDispatch::test_wrapper_subclass_extra_dispatch_keys, test/test_python_dispatch.py::TestPythonDispatch::test_wrapper_subclass_serializes, test/test_python_dispatch.py::TestPythonDispatcher::test_basic, test/test_python_dispatch.py::TestPythonDispatcher::test_lstsq, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_cat_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_conv2d_cuda, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyCatCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyCubeCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyMulCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyNMSCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyNonzeroCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpySortCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpySplitCopyCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpySplitCopyWithIntCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyTakeCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_custom_NumpyViewCopyCustomOp_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_mul_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_native_batch_norm_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_out_op_cuda, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_split_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_split_list_args_cuda_float32, test/test_python_dispatch.py::TestWrapperSubclassAliasingCUDA::test_wrapper_subclass_aliasing_view_cuda_float32 2024-02-01T00:55:51.5676750Z 2024-02-01T00:55:51.5677139Z Running inductor/test_layout_optim 1/1 ... [2024-02-01 00:55:51.558871] 2024-02-01T00:55:51.5678941Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_layout_optim.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:55:51.559100] 2024-02-01T00:56:02.7935314Z 2024-02-01T00:56:02.7937028Z inductor/test_layout_optim 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_layout_optim_1.1_131ec340484235f0_.log 2024-02-01T00:56:02.7942215Z Running 10 items in this shard: test/inductor/test_layout_optim.py::TestLayoutOptim::test_2conv_with_graph_break, test/inductor/test_layout_optim.py::TestLayoutOptim::test_3conv_with_graph_break, test/inductor/test_layout_optim.py::TestLayoutOptim::test_dynamic_shape_specialization, test/inductor/test_layout_optim.py::TestLayoutOptim::test_keep_output_layout_infer, test/inductor/test_layout_optim.py::TestLayoutOptim::test_keep_output_layout_with_freezing, test/inductor/test_layout_optim.py::TestLayoutOptim::test_mutate_base, test/inductor/test_layout_optim.py::TestLayoutOptim::test_mutate_base_for_conv_output, test/inductor/test_layout_optim.py::TestLayoutOptim::test_mutate_view, test/inductor/test_layout_optim.py::TestLayoutOptim::test_mutate_view_for_conv_output, test/inductor/test_layout_optim.py::TestLayoutOptim::test_training_acc 2024-02-01T00:56:02.7946169Z 2024-02-01T00:56:02.7946511Z Running dynamo/test_unspec 1/1 ... [2024-02-01 00:56:02.793623] 2024-02-01T00:56:02.7948177Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_unspec.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:56:02.793836] 2024-02-01T00:56:13.3275670Z 2024-02-01T00:56:13.3277126Z dynamo/test_unspec 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_unspec_1.1_56933a6e635a7bbb_.log 2024-02-01T00:56:13.3289544Z Running 26 items in this shard: test/dynamo/test_unspec.py::UnspecTests::test_argmin_coerces_symint_to_intlist_spec, test/dynamo/test_unspec.py::UnspecTests::test_builtin_functions_on_cuda, test/dynamo/test_unspec.py::UnspecTests::test_builtin_getitem, test/dynamo/test_unspec.py::UnspecTests::test_builtin_max_min, test/dynamo/test_unspec.py::UnspecTests::test_compiled_random_calls_are_random, test/dynamo/test_unspec.py::UnspecTests::test_conv1d_symint_padding, test/dynamo/test_unspec.py::UnspecTests::test_data_dependent_evaluate_expr_graph_break, test/dynamo/test_unspec.py::UnspecTests::test_exponential, test/dynamo/test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only, test/dynamo/test_unspec.py::UnspecTests::test_isinstance_symint, test/dynamo/test_unspec.py::UnspecTests::test_mark_01_dynamic, test/dynamo/test_unspec.py::UnspecTests::test_multiple_consecutive_random_calls_before_graph, test/dynamo/test_unspec.py::UnspecTests::test_no_recompilations, test/dynamo/test_unspec.py::UnspecTests::test_no_recompiles, test/dynamo/test_unspec.py::UnspecTests::test_numpy_correctness, test/dynamo/test_unspec.py::UnspecTests::test_propagate_dynamic_dim, test/dynamo/test_unspec.py::UnspecTests::test_random_call_with_while_loop, test/dynamo/test_unspec.py::UnspecTests::test_random_values_with_graph_break, test/dynamo/test_unspec.py::UnspecTests::test_rshift_dynamic, test/dynamo/test_unspec.py::UnspecTests::test_shape_graph_break, test/dynamo/test_unspec.py::UnspecTests::test_specializing_numpy_float_in_control_flow, test/dynamo/test_unspec.py::UnspecTests::test_sum_dimlist_spec, test/dynamo/test_unspec.py::UnspecTests::test_sym_int_conversion, test/dynamo/test_unspec.py::UnspecTests::test_symfloat_to_tensor, test/dynamo/test_unspec.py::UnspecTests::test_unspec_float_precision, test/dynamo/test_unspec.py::UnspecTests::test_use_and_specialize 2024-02-01T00:56:13.3303150Z 2024-02-01T00:56:13.3303674Z Running test_tensor_creation_ops 1/1 ... [2024-02-01 00:56:13.327760] 2024-02-01T00:56:13.3306363Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensor_creation_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:56:13.327980] 2024-02-01T00:57:48.1989315Z 2024-02-01T00:57:48.1991013Z test_tensor_creation_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensor_creation_ops_1.1_c5ba8e04cbce3b99_.log 2024-02-01T00:57:48.2204689Z Running 510 items in this shard: test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_device_vs_cpu_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_device_vs_cpu_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_device_vs_cpu_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_device_vs_cpu_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_inference_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_lowp_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_arange_lowp_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_as_strided_neg_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_as_tensor_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_block_diag_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_block_diag_scipy_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cartesian_prod_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat2_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat2_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat2_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_all_dtypes_and_devices_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_big_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_empty_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_empty_legacy_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_in_channels_last_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_mem_overlap_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_channels_last_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_uint16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_uint32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_uint64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_fast_path_dim0_dim1_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_out_memory_format_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_preserve_channels_last_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_cat_stack_cross_devices_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_combinations_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_complex_type_conversions_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_constructor_device_legacy_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_constructor_dtypes_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_ctor_with_numpy_array_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_device_rounding_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_device_rounding_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_device_rounding_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_diag_embed_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_diagflat_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dsplit_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dsplit_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dsplit_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_dstack_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_empty_full_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_empty_overflow_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_empty_strided_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_empty_tensor_props_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_eye_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_fill_all_dtypes_and_devices_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_bool, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_finite_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_bool, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_float_to_int_conversion_nonfinite_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_from_file_shared_False_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_from_file_shared_True_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_full_inference_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_full_inference_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_full_inference_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_full_out_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hsplit_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hsplit_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hsplit_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_hstack_column_stack_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_window_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_window_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_window_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_window_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_kaiser_window_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_large_linspace_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_large_linspace_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_like_fn_stride_proparation_vs_tensoriterator_unary_op_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linlogspace_mem_overlap_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_deduction_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_device_vs_cpu_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_special_steps_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_complex_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_integral_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_integral_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_integral_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_integral_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_linspace_vs_numpy_integral_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_base2_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_base2_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_base2_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_deduction_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_device_vs_cpu_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_device_vs_cpu_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_device_vs_cpu_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_special_steps_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_special_steps_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_special_steps_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_vs_numpy_complex_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_vs_numpy_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_logspace_vs_numpy_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_default_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_empty_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_ij_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_ij_indexing_is_default_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_inconsistent_device_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_inconsistent_dtype_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_non_1d_tensor_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_unsupported_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_vs_numpy_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_warns_if_no_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_meshgrid_xy_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_new_empty_strided_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_new_methods_requires_grad_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_new_tensor_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_offset_scalar_cast_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_ones_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_bool_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_default_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_bool_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_uint16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_uint32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_from_to_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_uint16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_uint32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_full_range_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_uint16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_uint32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_random_to_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_range_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_range_factories_64bit_indexing_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_range_warning_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_repeat_interleave_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_roll_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_bartlett_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_bartlett_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_bartlett_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_bartlett_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_bartlett_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_blackman_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_blackman_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_blackman_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_blackman_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_blackman_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hamming_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hamming_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hamming_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hamming_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hamming_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hann_cuda_bfloat16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hann_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hann_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hann_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_window_functions_window_hann_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_bartlett_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_bartlett_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_blackman_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_blackman_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_cosine_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_cosine_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_hamming_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_hamming_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_hann_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_hann_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_nuttall_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_signal_windows_functions_window_nuttall_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_simple_scalar_cast_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_stack_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_stack_out_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_storage_filename_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_strided_mismatched_stride_shape_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_ctor_device_inference_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_device_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factories_empty_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factory_copy_var_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factory_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factory_gpu_type_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factory_gpu_type_inference_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_factory_type_inference_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_from_non_writable_numpy_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_tensor_from_sequence_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_bool, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_floating_dtype_error_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_out_dtype_error_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_out_dtype_error_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_same_dtype_error_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_complex_same_dtype_error_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_polar_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_torch_polar_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_unpack_double_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_unpack_double_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_bool, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vander_types_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vsplit_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vsplit_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vsplit_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_complex128, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_float64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_int32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_int8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_vstack_row_stack_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_cuda, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_bool, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_complex64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_float16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_float32, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_int16, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_int64, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_dtype_layout_device_match_cuda_uint8, test/test_tensor_creation_ops.py::TestTensorCreationCUDA::test_zeros_out_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_normal_cuda_float32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_normal_cuda_float64, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_normal_std_error_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_rand_cuda_complex128, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_rand_cuda_complex32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_rand_cuda_complex64, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_rand_cuda_float32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_rand_cuda_float64, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randint_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randint_inference_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_bfloat16, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_complex128, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_complex32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_complex64, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_float16, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_float32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randn_cuda_float64, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_random_neg_values_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randperm_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_randperm_device_compatibility_cuda, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_uniform_from_to_cuda_bfloat16, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_uniform_from_to_cuda_float16, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_uniform_from_to_cuda_float32, test/test_tensor_creation_ops.py::TestRandomTensorCreationCUDA::test_uniform_from_to_cuda_float64, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_empty_like_cuda, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_full_like_inference_cuda, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_ones_like_cuda, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_ones_like_multiple_device_cuda, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_zeros_like_cuda, test/test_tensor_creation_ops.py::TestLikeTensorCreationCUDA::test_zeros_like_multiple_device_cuda, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_uint16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_uint32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_uint64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_buffer_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_dlpack_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_uint16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_uint32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_uint64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_numpy_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_alias_from_tensor_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_astensor_consistency_cuda, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_uint16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_uint32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_uint64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_buffer_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_dlpack_mult_devices_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_uint16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_uint32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_uint64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_numpy_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_from_tensor_mult_devices_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_list_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_bfloat16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_bool, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_complex128, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_float16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_float64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_int16, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_int32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_int64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_int8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_copy_tensor_cuda_uint8, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_default_device_cuda, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_device_without_index_cuda, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_numpy_scalars_cuda, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_retain_autograd_history_cuda_complex64, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_retain_autograd_history_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_unsupported_alias_cuda_float32, test/test_tensor_creation_ops.py::TestAsArrayCUDA::test_unsupported_alias_mult_devices_cuda_float32 2024-02-01T00:57:48.2413418Z 2024-02-01T00:57:48.2413854Z Running export/test_passes 1/1 ... [2024-02-01 00:57:48.199711] 2024-02-01T00:57:48.2415555Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_passes.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:57:48.199962] 2024-02-01T00:57:51.7707839Z 2024-02-01T00:57:51.7711544Z export/test_passes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_passes_1.1_c42e545a756f66ae_.log 2024-02-01T00:57:51.7717174Z Running 12 items in this shard: test/export/test_passes.py::TestPasses::test_functionalization_with_view_copy, test/export/test_passes.py::TestPasses::test_functionalize_inline_constraints, test/export/test_passes.py::TestPasses::test_math_ops, test/export/test_passes.py::TestPasses::test_runtime_assert_inline_constraints_for_cond, test/export/test_passes.py::TestPasses::test_runtime_assert_inline_constraints_for_item, test/export/test_passes.py::TestPasses::test_runtime_assert_inline_constraints_for_nonzero, test/export/test_passes.py::TestPasses::test_runtime_assert_multiple_dims, test/export/test_passes.py::TestPasses::test_runtime_assert_one_dim, test/export/test_passes.py::TestPasses::test_runtime_assert_some_dims_not_specified, test/export/test_passes.py::TestPasses::test_runtime_assert_some_inps_not_used, test/export/test_passes.py::TestPasses::test_view_to_view_copy, test/export/test_passes.py::TestPasses::test_views_op_having_view_copy 2024-02-01T00:57:51.7721646Z 2024-02-01T00:57:51.7722004Z Running dynamo/test_subclasses 1/1 ... [2024-02-01 00:57:51.770722] 2024-02-01T00:57:51.7723710Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_subclasses.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:57:51.770936] 2024-02-01T00:58:04.0076101Z 2024-02-01T00:58:04.0077420Z dynamo/test_subclasses 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_subclasses_1.1_dad467b748106687_.log 2024-02-01T00:58:04.0095671Z Running 40 items in this shard: test/dynamo/test_subclasses.py::SubclassTests::test_base_torch_function_tracing, test/dynamo/test_subclasses.py::SubclassTests::test_compile_higher_order_with_functionalization, test/dynamo/test_subclasses.py::SubclassTests::test_compile_with_fake_tensor_automatic_dynamic, test/dynamo/test_subclasses.py::SubclassTests::test_compile_with_fake_tensor_dynamic_dim, test/dynamo/test_subclasses.py::SubclassTests::test_compile_with_functionalization, test/dynamo/test_subclasses.py::SubclassTests::test_has_torch_function, test/dynamo/test_subclasses.py::SubclassTests::test_no_call_to_new, test/dynamo/test_subclasses.py::SubclassTests::test_overridden_method_guarding, test/dynamo/test_subclasses.py::SubclassTests::test_recompile_with_symbool_inputs, test/dynamo/test_subclasses.py::SubclassTests::test_return_as_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_return_local_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_return_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_support_bases, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_call_on_attr, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_call_on_method, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_state_graph_break, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_state_guards, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_state_nested, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_state_tracing, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_wrapper_class, test/dynamo/test_subclasses.py::SubclassTests::test_torch_function_wrapper_class_with_kwargs, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_equality_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_equality_tensor, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_identity_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_identity_tensor, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_isinstance_subclass, test/dynamo/test_subclasses.py::SubclassTests::test_type_check_isinstance_tensor, test/dynamo/test_subclasses.py::SubclassTests::test_user_overidden_attr_unsupported, test/dynamo/test_subclasses.py::SubclassTests::test_user_overidden_method_unsupported, test/dynamo/test_subclasses.py::SubclassTests::test_user_overidden_property_unsupported, test/dynamo/test_subclasses.py::SubclassTests::test_wrapper_subclass_guards_on_inner_tensor, test/dynamo/test_subclasses.py::SubclassTests::test_wrapper_subclass_with_differently_sized_inner_tensor, test/dynamo/test_subclasses.py::SubclassTests::test_wrapper_subclass_with_same_sized_inner_tensor, test/dynamo/test_subclasses.py::TestNestedTensor::test_basic_autograd, test/dynamo/test_subclasses.py::TestNestedTensor::test_basic_autograd_inductor, test/dynamo/test_subclasses.py::TestNestedTensor::test_binary_does_not_recompile, test/dynamo/test_subclasses.py::TestNestedTensor::test_binary_recompiles, test/dynamo/test_subclasses.py::TestNestedTensor::test_inputs_to_compiled_fn_are_views, test/dynamo/test_subclasses.py::TestNestedTensor::test_unary_does_not_recompile, test/dynamo/test_subclasses.py::TestNestedTensor::test_unbind 2024-02-01T00:58:04.0112458Z 2024-02-01T00:58:04.0112822Z Running test_cpp_api_parity 1/1 ... [2024-02-01 00:58:04.007742] 2024-02-01T00:58:04.0114655Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_api_parity.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:04.007958] 2024-02-01T00:58:19.0498921Z 2024-02-01T00:58:19.0500515Z test_cpp_api_parity 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_api_parity_1.1_769febd81468634d_.log 2024-02-01T00:58:19.0712218Z Running 488 items in this shard: test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCELoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_BCEWithLogitsLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_circular_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_circular_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_groups, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_groups_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad1, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad1_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad1size1, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad1size1_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad2size1, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad2size1_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_same_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_valid, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_pad_valid_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_reflect_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_reflect_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_replicate_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_replicate_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_stride, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_stride_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_zero_batch, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_zero_batch_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_zeros_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv1d_zeros_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_circular_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_circular_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_padded, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_padded_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_strided, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_strided_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_with_multiplier, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_depthwise_with_multiplier_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_groups, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_groups_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_groups_thnn, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_groups_thnn_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_same, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_same_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_same_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_same_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_valid, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_pad_valid_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_padding, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_padding_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_reflect_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_reflect_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_replicate_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_replicate_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_strided, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_strided_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_zero_batch, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_zero_batch_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_zeros_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv2d_zeros_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_1x1x1_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_1x1x1_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_circular_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_circular_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_dilated_strided, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_dilated_strided_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_groups, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_groups_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_same, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_same_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_same_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_same_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_valid, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_pad_valid_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_replicate_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_replicate_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_stride, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_stride_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_stride_padding, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_stride_padding_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_zero_batch, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_zero_batch_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_zeros_stride2_pad2, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Conv3d_zeros_stride2_pad2_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_groups, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_groups_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose1d_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_groups, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_groups_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose2d_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose3d_dilated, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ConvTranspose3d_dilated_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CosineEmbeddingLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CrossMapLRN2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_CrossMapLRN2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_discontiguous, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_discontiguous_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_max, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_max_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_max_padding_idx, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_max_padding_idx_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_mean_padding_idx, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_mean_padding_idx_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sparse, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sparse_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sum_padding_idx, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_EmbeddingBag_sum_padding_idx_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding_discontiguous, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding_discontiguous_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding_sparse, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Embedding_sparse_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Flatten, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Flatten_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Flatten_no_batch_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Flatten_no_batch_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_int_input, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_int_input_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_no_batch_dim_input, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_no_batch_dim_input_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_no_batch_dim_int_input, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Fold_no_batch_dim_int_input_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_HingeEmbeddingLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_LayerNorm_3d_no_affine_large_feature, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_LayerNorm_3d_no_affine_large_feature_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear_no_batch_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear_no_batch_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear_no_bias, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Linear_no_bias_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MarginRankingLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelMarginLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_MultiLabelSoftMarginLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_NLLLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_broadcast_lhs, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_broadcast_lhs_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_broadcast_rhs, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_broadcast_rhs_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_no_batch_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_no_batch_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_with_non_default_args, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PairwiseDistance_with_non_default_args_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PixelShuffle, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PixelShuffle_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PixelUnshuffle, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_PixelUnshuffle_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU_with_up_down, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU_with_up_down_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU_with_up_down_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_RReLU_with_up_down_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d_complex, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d_complex_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d_no_batch_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_ReplicationPad3d_no_batch_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SampleModule_has_parity, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SampleModule_has_parity_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SampleModule_no_parity, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SampleModule_no_parity_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_SoftMarginLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerDecoderLayer_gelu_activation, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerDecoderLayer_gelu_activation_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerDecoderLayer_relu_activation, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerDecoderLayer_relu_activation_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerEncoderLayer_gelu_activation, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerEncoderLayer_gelu_activation_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerEncoderLayer_relu_activation, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TransformerEncoderLayer_relu_activation_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Transformer_multilayer_coder, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Transformer_multilayer_coder_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_mean, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_mean_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_none, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_none_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_sum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_TripletMarginLoss_no_batch_dim_sum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unflatten_no_batch_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unflatten_no_batch_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unfold, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unfold_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unfold_int_input, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_Unfold_int_input_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_weights_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_weights_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_weights_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCELoss_weights_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_legacy_enum, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_legacy_enum_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_BCEWithLogitsLoss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HingeEmbeddingLoss_margin_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HingeEmbeddingLoss_margin_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HingeEmbeddingLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HingeEmbeddingLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HuberLoss_delta, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_HuberLoss_delta_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_log_target, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_log_target_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_scalar_log_target, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_no_reduce_scalar_log_target_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_with_log_target_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_with_log_target_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_with_target_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_KLDivLoss_with_target_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce_complex, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce_complex_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_L1Loss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MSELoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MSELoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MSELoss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MSELoss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_0d_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_0d_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_1d_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_1d_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_index_neg, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_index_neg_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelMarginLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelSoftMarginLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelSoftMarginLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelSoftMarginLoss_weights_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiLabelSoftMarginLoss_weights_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_1d_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_1d_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_margin_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_margin_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_p_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_p_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_weights_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_MultiMarginLoss_weights_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce_ignore_index, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce_ignore_index_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce_weights, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss2d_no_reduce_weights_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce_ignore_index, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce_ignore_index_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce_weights, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLossNd_no_reduce_weights_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_ignore_index, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_ignore_index_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights_ignore_index, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights_ignore_index_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights_ignore_index_neg, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_NLLLoss_no_reduce_weights_ignore_index_neg_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_PoissonNLLLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_PoissonNLLLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_beta, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_beta_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_no_reduce_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_no_reduce_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_zero_beta, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SmoothL1Loss_zero_beta_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SoftMarginLoss_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_SoftMarginLoss_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_2d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_2d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_shared_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_shared_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_skewed_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_skewed_2d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_skewed_2d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_scale_tuple_skewed_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_tuple_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_tuple_2d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_tuple_2d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bicubic_tuple_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_2d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_2d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_shared_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_shared_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_skewed_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_skewed_2d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_skewed_2d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_scale_tuple_skewed_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_tuple_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_tuple_2d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_tuple_2d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_bilinear_tuple_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_1d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_scale_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_scale_1d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_scale_1d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_scale_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_tuple_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_linear_tuple_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_1d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_1d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d_launch_configs, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d_launch_configs_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_2d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_3d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_3d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_scale_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_1d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_1d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_2d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_2d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_nearest_tuple_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_3d_zero_dim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_3d_zero_dim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_scale_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_scale_3d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_scale_3d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_scale_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_tuple_3d, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_tuple_3d_align_corners, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_tuple_3d_align_corners_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_interpolate_trilinear_tuple_3d_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_dim0, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_dim0_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_dim3, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_dim3_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_lastdim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_lastdim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_spatial, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_spatial_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_spatial_special, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_log_softmax_spatial_special_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_multimarginloss_1d_input_0d_target_no_reduce, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_multimarginloss_1d_input_0d_target_no_reduce_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_sample_functional_has_parity, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_sample_functional_has_parity_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_sample_functional_no_parity, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_sample_functional_no_parity_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_dim0, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_dim0_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_dim3, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_dim3_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_scalar, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_functional_scalar_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_lastdim, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_lastdim_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_lastdim_dtype, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_lastdim_dtype_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial_dtype, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial_dtype_cuda, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial_special, test/test_cpp_api_parity.py::TestCppApiParity::test_torch_nn_functional_softmax_spatial_special_cuda 2024-02-01T00:58:19.0917894Z 2024-02-01T00:58:19.0918260Z Running dynamo/test_backends 1/1 ... [2024-02-01 00:58:19.050773] 2024-02-01T00:58:19.0920014Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_backends.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:19.050990] 2024-02-01T00:58:33.5917418Z 2024-02-01T00:58:33.5918830Z dynamo/test_backends 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_backends_1.1_ce002c65cfb8a3fb_.log 2024-02-01T00:58:33.5929363Z Running 17 items in this shard: test/dynamo/test_backends.py::TestOptimizations::test_aot_cudagraphs, test/dynamo/test_backends.py::TestOptimizations::test_aot_eager, test/dynamo/test_backends.py::TestOptimizations::test_aot_eager_decomp_partition, test/dynamo/test_backends.py::TestOptimizations::test_aot_ts, test/dynamo/test_backends.py::TestOptimizations::test_eager, test/dynamo/test_backends.py::TestOptimizations::test_example_inputs, test/dynamo/test_backends.py::TestOptimizations::test_example_inputs_runtime_use, test/dynamo/test_backends.py::TestOptimizations::test_list_backends, test/dynamo/test_backends.py::TestOptimizations::test_onnxrt, test/dynamo/test_backends.py::TestOptimizations::test_torchscript, test/dynamo/test_backends.py::TestOptimizations::test_tvm, test/dynamo/test_backends.py::NormalizeIRTests::test_inplace_normalize, test/dynamo/test_backends.py::MPSNotSupportedTest::test_mps_not_supported, test/dynamo/test_backends.py::TestExplainWithBackend::test_explain_with_backend, test/dynamo/test_backends.py::TestCustomBackendAPI::test_aot_autograd_api, test/dynamo/test_backends.py::TestCustomBackendAPI::test_lookup_backend, test/dynamo/test_backends.py::TestCustomBackendAPI::test_register_backend_api 2024-02-01T00:58:33.5935562Z 2024-02-01T00:58:33.5935942Z Running dynamo/test_after_aot 1/1 ... [2024-02-01 00:58:33.592216] 2024-02-01T00:58:33.5937655Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_after_aot.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:33.592444] 2024-02-01T00:58:41.6216411Z 2024-02-01T00:58:41.6217986Z dynamo/test_after_aot 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_after_aot_1.1_7ace6479d266a5b1_.log 2024-02-01T00:58:41.6219952Z Running 2 items in this shard: test/dynamo/test_after_aot.py::TestAfterAot::test_dump_tensor, test/dynamo/test_after_aot.py::TestAfterAot::test_save_graph_repro 2024-02-01T00:58:41.6221027Z 2024-02-01T00:58:41.6221510Z Running inductor/test_config 1/1 ... [2024-02-01 00:58:41.621601] 2024-02-01T00:58:41.6223231Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_config.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:41.621831] 2024-02-01T00:58:51.5029763Z 2024-02-01T00:58:51.5031081Z inductor/test_config 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_config_1.1_a6b528fa27962db9_.log 2024-02-01T00:58:51.5037330Z Running 11 items in this shard: test/inductor/test_config.py::TestInductorConfig::test_api_options, test/inductor/test_config.py::TestInductorConfig::test_compile_api, test/inductor/test_config.py::TestInductorConfig::test_compile_api_passes_config, test/inductor/test_config.py::TestInductorConfig::test_get_compiler_config, test/inductor/test_config.py::TestInductorConfig::test_hasattr, test/inductor/test_config.py::TestInductorConfig::test_invalid_backend, test/inductor/test_config.py::TestInductorConfig::test_invalid_names, test/inductor/test_config.py::TestInductorConfig::test_non_inductor_backend, test/inductor/test_config.py::TestInductorConfig::test_patch, test/inductor/test_config.py::TestInductorConfig::test_save_load, test/inductor/test_config.py::TestInductorConfig::test_set 2024-02-01T00:58:51.5042846Z 2024-02-01T00:58:51.5043468Z Running inductor/test_coordinate_descent_tuner 1/1 ... [2024-02-01 00:58:51.503045] 2024-02-01T00:58:51.5046640Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_coordinate_descent_tuner.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:51.503260] 2024-02-01T00:58:56.4275571Z 2024-02-01T00:58:56.4278885Z inductor/test_coordinate_descent_tuner 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_coordinate_descent_tuner_1.1_fb67dfe4f0b2bf76_.log 2024-02-01T00:58:56.4282672Z Running 4 items in this shard: test/inductor/test_coordinate_descent_tuner.py::TestCoordinateDescentTuner::test_abs_function, test/inductor/test_coordinate_descent_tuner.py::TestCoordinateDescentTuner::test_get_neighbour_values, test/inductor/test_coordinate_descent_tuner.py::TestCoordinateDescentTuner::test_no_neighbors, test/inductor/test_coordinate_descent_tuner.py::TestCoordinateDescentTuner::test_persistent_reduction 2024-02-01T00:58:56.4285954Z 2024-02-01T00:58:56.4286403Z Running test_schema_check 1/2 ... [2024-02-01 00:58:56.427730] 2024-02-01T00:58:56.4288967Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_schema_check.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 00:58:56.427979] 2024-02-01T01:08:32.5237869Z 2024-02-01T01:08:32.5239081Z test_schema_check 1/2 was successful, full logs can be found in artifacts with path test/test-reports/test_schema_check_1.2_81ca1644eb743c20_.log 2024-02-01T01:08:32.6912555Z Running 2959 items in this shard: test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_custom_ops_output_is_input, test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_custom_ops_secretly_aliasing, test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_custom_ops_secretly_mutating, test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_multiple_operators, test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_outputs_unexpectedly_aliasing, test/test_schema_check.py::TestSchemaCheck::test_alias_check_fail_simple, test/test_schema_check.py::TestSchemaCheck::test_is_alias_of_basic, test/test_schema_check.py::TestSchemaCheck::test_mutation_check_fail_multiple_operators, test/test_schema_check.py::TestSchemaCheck::test_overlaps_empty_container, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_aliasing_inputs, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_default_replaced, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_device_input, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_list_input, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_training_op, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_wildcard_after, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_functionality_with_multiple_outputs, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_mutated_aliasing_aliasing_outputs, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_mutated_aliasing_as_strided, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_mutated_aliasing_none, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_mutated_aliasing_resize_, test/test_schema_check.py::TestSchemaCheck::test_schema_check_mode_operator_order, test/test_schema_check.py::TestSchemaCheck::test_schema_info_bind_basic, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_H_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_H_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_H_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_H_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_H_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_T_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_T_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_T_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_T_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_T_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___getitem___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___radd___cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___radd___cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___radd___cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___radd___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___radd___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rand___cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rand___cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rand___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rand___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rdiv___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmatmul___cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmatmul___cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmod___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmul___cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmul___cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmul___cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmul___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rmul___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___ror___cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rpow___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rsub___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rxor___cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rxor___cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rxor___cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rxor___cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness___rxor___cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__native_batch_norm_legit_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__segment_reduce_lengths_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__segment_reduce_lengths_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__segment_reduce_offsets_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__segment_reduce_offsets_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__upsample_bilinear2d_aa_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness__upsample_bilinear2d_aa_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_abs_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acos_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acosh_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acosh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acosh_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acosh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_acosh_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_add_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addbmm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addbmm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcdiv_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcdiv_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcmul_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcmul_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcmul_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addcmul_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_decomposed_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_decomposed_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_decomposed_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_decomposed_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmm_decomposed_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmv_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmv_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmv_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addmv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_addr_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_all_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_allclose_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_allclose_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_allclose_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amax_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_amin_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_aminmax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_aminmax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_aminmax_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_aminmax_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_angle_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_angle_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_any_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_arange_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_arange_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_arange_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmax_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmax_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmax_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argmin_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argsort_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argsort_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argsort_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argsort_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_argwhere_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_partial_views_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_as_strided_scatter_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asin_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asinh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asinh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asinh_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asinh_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_asinh_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atan_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atanh_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_1d_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_2d_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_3d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_3d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_3d_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_atleast_3d_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_baddbmm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bernoulli_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bernoulli_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bfloat16_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bfloat16_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bfloat16_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bfloat16_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bfloat16_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bincount_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bincount_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bincount_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bincount_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_and_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_and_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_and_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_and_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_left_shift_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_left_shift_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_left_shift_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_not_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_not_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_or_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_or_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_or_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_or_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_right_shift_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_xor_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_xor_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bitwise_xor_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_block_diag_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bmm_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bmm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bmm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bool_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_tensors_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_tensors_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_tensors_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_tensors_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_tensors_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_broadcast_to_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bucketize_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bucketize_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_bucketize_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_byte_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cartesian_prod_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cat_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cauchy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cauchy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cauchy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdist_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdist_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cdouble_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ceil_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ceil_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cfloat_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chalf_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chalf_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chalf_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chalf_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chalf_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_char_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_inverse_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_inverse_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_inverse_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_solve_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cholesky_solve_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_chunk_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_max_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_max_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_max_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_min_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_min_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_min_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_min_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clamp_min_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clone_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clone_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clone_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clone_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_clone_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_column_stack_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_combinations_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_complex_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_complex_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_conj_physical_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_constant_pad_nd_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_contiguous_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_contiguous_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_contiguous_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_contiguous_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_contiguous_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_copysign_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_copysign_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_copysign_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_copysign_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_copysign_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_corrcoef_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cos_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cos_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cos_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cos_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cos_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cosh_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_count_nonzero_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cov_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cov_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cov_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cov_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cross_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cross_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cross_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cross_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummax_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cummin_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumprod_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumprod_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumprod_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumsum_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumulative_trapezoid_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumulative_trapezoid_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumulative_trapezoid_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumulative_trapezoid_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_cumulative_trapezoid_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_deg2rad_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_deg2rad_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_deg2rad_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diag_embed_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagflat_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_copy_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diagonal_scatter_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_diff_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_digamma_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dist_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dist_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dist_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_floor_rounding_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_floor_rounding_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_floor_rounding_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_floor_rounding_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_floor_rounding_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_no_rounding_mode_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_trunc_rounding_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_trunc_rounding_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_trunc_rounding_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_trunc_rounding_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_div_trunc_rounding_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dot_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dot_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_double_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_double_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_double_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_double_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_double_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dsplit_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dsplit_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dsplit_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dsplit_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_dstack_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_einsum_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_einsum_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_einsum_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_like_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_permuted_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_strided_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_strided_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_strided_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_empty_strided_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eq_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_equal_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_equal_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_equal_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_equal_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erf_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erf_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erf_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfc_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfc_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfc_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfc_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_erfinv_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exp_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_as_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expand_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_expm1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exponential_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_exponential_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_eye_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft2_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fft_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftn_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_fftshift_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfft_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfftn_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfftn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_hfftn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft2_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifft_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftn_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftshift_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftshift_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftshift_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ifftshift_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfft_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfftn_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfftn_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfftn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_ihfftn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfft_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_irfftn_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfft_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfftn_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfftn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfftn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfftn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft_rfftn_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fill_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fill_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fill_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fill_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fill_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flatten_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flip_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fliplr_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_flipud_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_float_power_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_divide_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_divide_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_divide_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_floor_divide_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmax_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmax_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmin_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmin_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmod_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmod_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmod_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmod_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fmod_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_frac_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_frac_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_frac_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_frexp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_frexp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_full_like_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gather_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gather_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gather_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gather_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gather_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gcd_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gcd_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ge_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ge_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ge_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ge_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ge_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_geometric_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_geometric_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_geometric_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_geqrf_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_geqrf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gradient_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_grid_sampler_2d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_grid_sampler_2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_grid_sampler_2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_gt_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_half_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_heaviside_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_heaviside_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_heaviside_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_heaviside_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_heaviside_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_histc_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_histc_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_histc_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hsplit_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hsplit_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hsplit_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hsplit_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hsplit_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hstack_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hypot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_hypot_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_i0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_i0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_igamma_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_imag_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_imag_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_add_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_copy_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_copy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_copy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_copy_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_copy_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_fill_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_fill_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_fill_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_fill_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_fill_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_put_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_reduce_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_reduce_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_reduce_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_reduce_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_reduce_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_select_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_select_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_select_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_index_select_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_inner_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_int_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isclose_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isclose_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isclose_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isclose_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isfinite_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isfinite_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isfinite_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isfinite_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isfinite_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isin_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isin_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isin_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isin_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isinf_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isnan_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isneginf_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isposinf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isposinf_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isposinf_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isposinf_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_isreal_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_istft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_item_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_2inputs_2outputs_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_2inputs_2outputs_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_2inputs_2outputs_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_2inputs_2outputs_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_4inputs_with_extra_args_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_return_by_ref_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_return_by_ref_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_return_by_ref_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_binary_return_by_ref_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_jiterator_unary_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kron_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kthvalue_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kthvalue_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_kthvalue_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lcm_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lcm_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lcm_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ldexp_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_le_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_le_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_le_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_le_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_le_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lerp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lgamma_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cholesky_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cholesky_ex_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cross_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cross_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cross_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cross_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_cross_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_singular_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_singular_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_det_singular_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_diagonal_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_diagonal_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_diagonal_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_diagonal_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_diagonal_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eig_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eig_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigh_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigvals_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigvals_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigvalsh_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigvalsh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_eigvalsh_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_householder_product_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_householder_product_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_householder_product_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_inv_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_inv_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_inv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_inv_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_inv_ex_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_ldl_factor_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_ldl_factor_ex_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_ldl_factor_ex_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_ldl_factor_ex_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_ldl_solve_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lstsq_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lstsq_grad_oriented_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lstsq_grad_oriented_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lstsq_grad_oriented_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_factor_ex_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_factor_ex_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_factor_ex_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_lu_solve_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_norm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_power_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_power_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_power_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_rank_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_rank_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_rank_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_rank_hermitian_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_matrix_rank_hermitian_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_multi_dot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_multi_dot_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_subgradients_at_zero_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_subgradients_at_zero_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_norm_subgradients_at_zero_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_pinv_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_pinv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_pinv_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_pinv_hermitian_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_pinv_hermitian_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_qr_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_qr_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_slogdet_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_slogdet_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_solve_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_solve_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_solve_triangular_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_solve_triangular_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_svd_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_svd_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_svdvals_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_svdvals_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_svdvals_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_tensorinv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_tensorsolve_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vander_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vander_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vander_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vecdot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vecdot_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vector_norm_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vector_norm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linalg_vector_norm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_linspace_tensor_overload_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log10_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log1p_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_normal_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_normal_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_log_softmax_with_dtype_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logaddexp2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logaddexp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logaddexp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logcumsumexp_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logcumsumexp_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logdet_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logdet_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logdet_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logdet_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_and_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_and_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_and_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_and_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_not_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_or_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logical_xor_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logit_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logit_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logit_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logit_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logit_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logspace_tensor_overload_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_logsumexp_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_long_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_long_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_long_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_long_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lt_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_solve_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_solve_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_solve_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_unpack_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_lu_unpack_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mH_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mH_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mH_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mH_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mH_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mT_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amax_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amin_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_amin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmax_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmax_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmax_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmax_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_argmin_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumprod_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumsum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumsum_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumsum_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumsum_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_cumsum_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_fill_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_log_softmax_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_log_softmax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_logaddexp_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_logaddexp_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_logsumexp_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_logsumexp_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_logsumexp_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_mean_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_median_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_median_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_norm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_normalize_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_normalize_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_normalize_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_prod_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_prod_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_prod_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_prod_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_scatter_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_scatter_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_scatter_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_scatter_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_scatter_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_select_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_select_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_select_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_select_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_select_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_softmin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_softmin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_softmin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_std_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_sum_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_sum_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_sum_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_sum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_var_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_var_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_var_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_var_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_masked_var_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_matmul_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_matmul_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_matmul_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_matrix_exp_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_matrix_exp_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_binary_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_pool2d_with_indices_backward_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_pool2d_with_indices_backward_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_pool2d_with_indices_backward_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_no_dim_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_with_dim_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_with_dim_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_with_dim_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_with_dim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_max_reduction_with_dim_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_maximum_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_maximum_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_maximum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_maximum_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mean_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mean_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mean_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mean_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_median_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_median_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_median_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_median_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_list_of_tensors_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_meshgrid_variadic_tensors_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_binary_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_no_dim_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_min_reduction_with_dim_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_minimum_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mm_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mode_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_movedim_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_msort_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_msort_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_msort_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_msort_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_msort_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mul_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mul_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mul_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mul_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mul_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mv_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mv_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mv_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_1_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_3_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_3_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_3_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_3_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_3_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_mvlgamma_mvlgamma_p_5_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nan_to_num_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmean_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmean_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmean_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmean_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmedian_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmedian_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmedian_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanmedian_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nanquantile_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nansum_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_copy_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_narrow_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_native_dropout_backward_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_native_dropout_backward_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_native_dropout_backward_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_native_layer_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ne_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_neg_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_neg_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_neg_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_neg_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_neg_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_empty_strided_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_full_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_full_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_full_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_full_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_full_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_ones_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_ones_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_ones_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_ones_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_new_zeros_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nextafter_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool1d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool2d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool2d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_avg_pool3d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool1d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool1d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool1d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool3d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_adaptive_max_pool3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_alpha_dropout_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_alpha_dropout_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool1d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool1d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool2d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool3d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_avg_pool3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_batch_norm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_batch_norm_without_cudnn_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_batch_norm_without_cudnn_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_bilinear_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_bilinear_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_bilinear_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_binary_cross_entropy_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_binary_cross_entropy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_binary_cross_entropy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_binary_cross_entropy_with_logits_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_celu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv1d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv1d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv1d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv2d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv2d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv3d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv3d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose1d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose1d_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose1d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose2d_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose2d_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_conv_transpose3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_embedding_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_embedding_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_embedding_loss_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_embedding_loss_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_similarity_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_cosine_similarity_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_ctc_loss_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_ctc_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout2d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout3d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_dropout_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_elu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_elu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_elu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_bag_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_bag_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_embedding_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_with_train_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_feature_alpha_dropout_without_train_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_fractional_max_pool3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_fractional_max_pool3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_fractional_max_pool3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_gaussian_nll_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_gaussian_nll_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_gelu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_gelu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_glu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_glu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_glu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_grid_sample_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_grid_sample_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_grid_sample_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_group_norm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardshrink_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardshrink_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardsigmoid_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardsigmoid_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardswish_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardswish_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardtanh_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardtanh_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardtanh_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardtanh_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hardtanh_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hinge_embedding_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_hinge_embedding_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_huber_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_huber_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_instance_norm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_area_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_area_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bicubic_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bicubic_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bicubic_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bilinear_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bilinear_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_bilinear_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_linear_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest-exact_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest-exact_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest-exact_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_nearest_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_interpolate_trilinear_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_kl_div_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_kl_div_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_kl_div_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_l1_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_l1_loss_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_l1_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_l1_loss_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_layer_norm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_layer_norm_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_layer_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_leaky_relu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_leaky_relu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_linear_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_linear_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_local_response_norm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_local_response_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_logsigmoid_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_margin_ranking_loss_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool1d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool1d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool2d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool2d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool3d_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_pool3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool1d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool1d_grad_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool1d_grad_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool2d_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool2d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool2d_grad_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool2d_grad_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool2d_grad_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool3d_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool3d_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool3d_grad_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_max_unpool3d_grad_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_mish_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_mse_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_mse_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_mse_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multi_head_attention_forward_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multi_head_attention_forward_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multi_margin_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multi_margin_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multi_margin_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multilabel_margin_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multilabel_soft_margin_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_nll_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_nll_loss_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_nll_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_normalize_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_normalize_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_normalize_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_one_hot_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_circular_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_constant_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_reflect_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pad_replicate_negative_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pairwise_distance_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pdist_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_shuffle_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_unshuffle_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_unshuffle_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_unshuffle_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_pixel_unshuffle_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_poisson_nll_loss_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_prelu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_prelu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_prelu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu6_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu6_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu6_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu6_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu6_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_relu_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_rrelu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_rrelu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_rrelu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_scaled_dot_product_attention_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_scaled_dot_product_attention_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_selu_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_selu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_silu_complex_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_silu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_silu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_silu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_smooth_l1_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_soft_margin_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_soft_margin_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softmin_with_dtype_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softplus_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softplus_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softplus_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_softsign_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_tanhshrink_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_tanhshrink_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_tanhshrink_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_threshold_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_threshold_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_threshold_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_threshold_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_threshold_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_loss_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_loss_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_loss_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_with_distance_loss_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_with_distance_loss_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_with_distance_loss_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_triplet_margin_with_distance_loss_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_unfold_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_unfold_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_unfold_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_unfold_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_unfold_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_upsample_bilinear_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_upsample_nearest_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_upsample_nearest_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nn_functional_upsample_nearest_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_nonzero_static_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_fro_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_fro_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_fro_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_inf_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_inf_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_inf_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_inf_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_norm_nuc_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_in_place_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_in_place_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_in_place_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_number_mean_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_normal_number_mean_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ones_like_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ormqr_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ormqr_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_outer_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pca_lowrank_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_permute_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polar_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_0_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_1_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_1_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_1_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_1_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_2_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_2_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_2_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_2_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_3_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_3_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_3_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_3_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_3_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_polygamma_polygamma_n_4_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_positive_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_positive_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_positive_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_positive_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_positive_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_pow_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_prod_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_put_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_qr_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_qr_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_qr_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_quantile_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rad2deg_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rad2deg_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rad2deg_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rad2deg_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_like_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_like_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_like_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_like_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randint_like_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randn_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randn_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randn_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_randn_like_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_ravel_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_real_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reciprocal_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reciprocal_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reciprocal_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reciprocal_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reciprocal_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_remainder_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_remainder_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_remainder_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_renorm_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_renorm_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_renorm_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_repeat_interleave_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_as_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_reshape_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize__cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize_as__cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize_as__cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize_as__cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize_as__cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resize_as__cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_conj_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_resolve_neg_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_roll_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rot90_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_0_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_0_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_3_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_3_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_3_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_neg_3_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_neg_3_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_neg_3_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_round_decimals_neg_3_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsqrt_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsub_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsub_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsub_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_rsub_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scalar_tensor_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_add_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_add_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_add_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amax_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amax_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amax_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amax_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_amin_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_mean_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_mean_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_mean_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_mean_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_prod_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_prod_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_prod_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_prod_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_sum_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_sum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_sum_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_scatter_reduce_sum_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_searchsorted_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_searchsorted_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_searchsorted_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_searchsorted_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_scatter_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_scatter_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_select_scatter_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sgn_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_short_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_short_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_short_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_short_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_short_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sigmoid_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sign_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_bartlett_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_bartlett_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_blackman_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_cosine_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_exponential_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_gaussian_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_general_cosine_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_general_hamming_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_general_hamming_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_hann_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_kaiser_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_kaiser_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signal_windows_nuttall_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_signbit_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sin_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinc_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinc_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinc_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinc_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sinh_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_slice_scatter_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_softmax_with_dtype_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sort_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sort_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sort_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sort_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sort_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sparse_mm_reduce_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sparse_sampled_addmm_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_airy_ai_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_airy_ai_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_airy_ai_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_airy_ai_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_airy_ai_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j0_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j1_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j1_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j1_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_j1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y0_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y1_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y1_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y1_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_bessel_y1_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_t_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_t_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_t_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_u_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_u_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_u_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_u_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_v_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_v_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_v_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_w_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_w_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_w_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_chebyshev_polynomial_w_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_entr_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_entr_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_entr_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_erfcx_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_erfcx_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_erfcx_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_hermite_polynomial_h_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_hermite_polynomial_he_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_hermite_polynomial_he_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i0e_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1e_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1e_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1e_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1e_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_i1e_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_laguerre_polynomial_l_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_laguerre_polynomial_l_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_laguerre_polynomial_l_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_laguerre_polynomial_l_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_legendre_polynomial_p_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_log_ndtr_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_log_ndtr_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_log_ndtr_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_log_ndtr_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_log_ndtr_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i0_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i0_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i0_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i0_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i1_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i1_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i1_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i1_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_i1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k0_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k0_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k0_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k1_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k1_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k1_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_modified_bessel_k1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtr_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtr_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtr_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtr_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtri_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtri_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtri_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_ndtri_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_polygamma_special_polygamma_n_0_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_polygamma_special_polygamma_n_0_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_polygamma_special_polygamma_n_0_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_polygamma_special_polygamma_n_0_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_polygamma_special_polygamma_n_0_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_scaled_modified_bessel_k0_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_scaled_modified_bessel_k0_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_scaled_modified_bessel_k0_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_scaled_modified_bessel_k1_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_t_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_t_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_t_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_u_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_u_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_u_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_u_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_v_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_w_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_shifted_chebyshev_polynomial_w_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_spherical_bessel_j0_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_spherical_bessel_j0_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_spherical_bessel_j0_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_xlog1py_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_zeta_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_zeta_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_zeta_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_zeta_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_special_zeta_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_list_args_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_with_sizes_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_with_sizes_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_with_sizes_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_split_with_sizes_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sqrt_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_square_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_squeeze_multiple_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stack_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_unbiased_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_unbiased_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_unbiased_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_mean_unbiased_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_unbiased_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_std_unbiased_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_stft_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sub_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_sum_to_size_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_lowrank_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_svd_lowrank_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_t_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_along_dim_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_take_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tan_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tanh_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tanh_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tanh_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tanh_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tanh_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensor_split_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensor_split_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensor_split_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensor_split_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensor_split_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensordot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensordot_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensordot_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensordot_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tensordot_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tile_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tile_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tile_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tile_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tile_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_sparse_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_sparse_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_sparse_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_to_sparse_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_topk_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trace_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_transpose_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_transpose_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_transpose_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_transpose_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_transpose_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapezoid_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapezoid_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapezoid_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trapz_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triangular_solve_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triangular_solve_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triangular_solve_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triangular_solve_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_tril_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_triu_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_true_divide_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trunc_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trunc_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trunc_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_trunc_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unbind_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unflatten_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_copy_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unfold_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_uniform_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_uniform_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_uniform_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_consecutive_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_consecutive_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_consecutive_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_consecutive_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_consecutive_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unique_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unravel_index_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unravel_index_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unravel_index_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unravel_index_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_chunk_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsafe_split_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_unsqueeze_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_unbiased_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_unbiased_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_unbiased_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_unbiased_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_mean_unbiased_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_unbiased_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_unbiased_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_var_unbiased_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vdot_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vdot_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vdot_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vdot_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vdot_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_complex_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_real_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_as_real_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_copy_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_copy_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_copy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_copy_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_view_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vsplit_cuda_uint8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vstack_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vstack_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vstack_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_vstack_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_where_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_where_cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_where_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_where_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_where_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_xlogy_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_xlogy_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_xlogy_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_xlogy_cuda_float64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zero__cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zero__cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zero__cuda_float16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_int16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_bfloat16, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_bool, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_complex128, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_complex32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_complex64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_float32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_int32, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_int64, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_int8, test/test_schema_check.py::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_zeros_like_cuda_uint8 2024-02-01T01:08:32.8370930Z 2024-02-01T01:08:32.8371303Z Running nn/test_convolution 3/3 ... [2024-02-01 01:08:32.528639] 2024-02-01T01:08:32.8372988Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_convolution.py', '--shard-id=3', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:08:32.528868] 2024-02-01T01:17:21.2030124Z 2024-02-01T01:17:21.2031692Z nn/test_convolution 3/3 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_convolution_3.3_97de5273c09bd022_.log 2024-02-01T01:17:21.2160284Z Running 205 items in this shard: test/nn/test_convolution.py::TestConvolutionNN::test_Conv1d_module_same_padding, test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_groups_nobias, test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_groups_nobias_v2, test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types_on_GPU_without_cudnn, test/nn/test_convolution.py::TestConvolutionNN::test_Conv3d_groups_nobias, test/nn/test_convolution.py::TestConvolutionNN::test_ConvTranspose2d_half_cublas_gemm, test/nn/test_convolution.py::TestConvolutionNN::test_ConvTranspose2d_output_size_downsample_upsample, test/nn/test_convolution.py::TestConvolutionNN::test_ConvTranspose3d_correct_output_size, test/nn/test_convolution.py::TestConvolutionNN::test_conv_padding_mode, test/nn/test_convolution.py::TestConvolutionNN::test_cudnn_non_contiguous, test/nn/test_convolution.py::TestConvolutionNN::test_functional_grad_conv, test/nn/test_convolution.py::TestConvolutionNN::test_grad_conv1d_weight, test/nn/test_convolution.py::TestConvolutionNN::test_grad_conv3d_weight, test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_backward_depthwise_cuda_complex128, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_depthwise_naive_groups_cuda_float64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_deterministic_cudnn_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_deterministic_cudnn_cuda_float16, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_deterministic_cudnn_cuda_float64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_large_workspace_cuda_bfloat16, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_large_workspace_cuda_float16, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv2d_large_workspace_cuda_float64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_Conv3d_depthwise_naive_groups_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_ConvTranspose2d_size_1_kernel_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_ConvTranspose3d_size_1_kernel_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv1d_same_padding_backward_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv1d_same_padding_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv1d_valid_padding_backward_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv1d_vs_scipy_mode_valid_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv2d_same_padding_backward_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv2d_same_padding_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv2d_valid_padding_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv2d_vs_scipy_mode_same_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_same_padding_backward_cuda_float64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_valid_padding_backward_cuda_complex128, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_valid_padding_backward_cuda_float64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_valid_padding_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_vs_scipy_mode_valid_cuda_complex64, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_vs_scipy_mode_valid_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cuda_depthwise1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cuda_depthwise1d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cuda_depthwise1d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cuda_depthwise2d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cuda_depthwise3d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn1d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn1d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn1d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn1d_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_transposed_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn2d_transposed_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn3d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn3d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn3d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_cudnn3d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch1d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch1d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch2d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch2d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch2d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch3d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch3d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch3d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel1d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel1d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel1d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel3d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel3d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_batch_channel3d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel1d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel1d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel2d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel2d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel2d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel3d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel3d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel3d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_empty_channel3d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen1d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen1d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen1d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen1d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen1d_transposed_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_transposed_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen2d_transposed_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_transposed_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen3d_transposed_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise1d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise1d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise1d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise1d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise2d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise2d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_miopen_depthwise3d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_cpu_input_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_cpu_input_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_cpu_input_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_cpu_input_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_cpu_input_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_transposed_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn1d_transposed_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_cpu_input_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_cpu_input_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_transposed_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn2d_transposed_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_cpu_input_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_cpu_input_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn3d_transposed_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch1d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch1d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch2d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch2d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch3d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch3d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel1d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel1d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel1d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel2d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_batch_channel3d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel1d_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel1d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel2d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel2d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel2d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel3d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_mkldnn_empty_channel3d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_dilated_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_dilated_has_bias_True_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_dilated_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_dilated_transposed_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_transposed_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow1d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_dilated_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_dilated_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_dilated_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_dilated_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_dilated_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_transposed_has_bias_False_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_transposed_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_transposed_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow2d_transposed_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cpu_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cpu_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cpu_has_bias_True_strided_False_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cpu_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cuda_has_bias_True_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_cuda_has_bias_True_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_dilated_has_bias_False_strided_False_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_dilated_has_bias_False_strided_True_contiguous_False_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_backend_slow3d_dilated_has_bias_False_strided_True_contiguous_True_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_mismatch_memory_format_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_empty_channel_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_large_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_large_nosplit_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_noncontig_weights_and_bias_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_thnn_nhwc_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_convert_conv2d_weight_memory_format_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float16, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float32, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_group_convTranspose_empty_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_noncontig_conv_grad_cuda_float64 2024-02-01T01:17:21.2285663Z 2024-02-01T01:17:21.2286126Z Running test_model_exports_to_core_aten 1/1 ... [2024-02-01 01:17:21.203512] 2024-02-01T01:17:21.2287924Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_model_exports_to_core_aten.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:21.203726] 2024-02-01T01:17:23.8727411Z 2024-02-01T01:17:23.8729073Z test_model_exports_to_core_aten 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_model_exports_to_core_aten_1.1_c635f9cce0015727_.log 2024-02-01T01:17:23.8732481Z Running 1 items in this shard: test/test_model_exports_to_core_aten.py::TestQuantizePT2EModels::test_vit_aten_export 2024-02-01T01:17:23.8735511Z 2024-02-01T01:17:23.8736016Z Running dynamo/test_pre_dispatch 1/1 ... [2024-02-01 01:17:23.872614] 2024-02-01T01:17:23.8737951Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_pre_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:23.872857] 2024-02-01T01:17:26.6423745Z 2024-02-01T01:17:26.6425747Z dynamo/test_pre_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_pre_dispatch_1.1_e927df64dc23ec4f_.log 2024-02-01T01:17:26.6428316Z Running 3 items in this shard: test/dynamo/test_pre_dispatch.py::PreDispatchTests::test_autocast_simple, test/dynamo/test_pre_dispatch.py::PreDispatchTests::test_enable_grad_and_no_grad, test/dynamo/test_pre_dispatch.py::PreDispatchTests::test_no_grad_simple 2024-02-01T01:17:26.6429812Z 2024-02-01T01:17:26.6430170Z Running export/test_safeguard 1/1 ... [2024-02-01 01:17:26.642322] 2024-02-01T01:17:26.6431865Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_safeguard.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:26.642556] 2024-02-01T01:17:29.7123817Z 2024-02-01T01:17:29.7125845Z export/test_safeguard 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_safeguard_1.1_d9415d74b8a0f50e_.log 2024-02-01T01:17:29.7128378Z Running 3 items in this shard: test/export/test_safeguard.py::TestSafeguard::test_global_autograd, test/export/test_safeguard.py::TestSafeguard::test_global_autograd_exempt_predispatch, test/export/test_safeguard.py::TestSafeguard::test_tensor_autograd 2024-02-01T01:17:29.7129860Z 2024-02-01T01:17:29.7130319Z Running torch_np/numpy_tests/lib/test_type_check 1/1 ... [2024-02-01 01:17:29.712315] 2024-02-01T01:17:29.7132220Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'torch_np/numpy_tests/lib/test_type_check.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:29.712596] 2024-02-01T01:17:32.2314055Z 2024-02-01T01:17:32.2315872Z torch_np/numpy_tests/lib/test_type_check 1/1 was successful, full logs can be found in artifacts with path test/test-reports/torch_np.numpy_tests.lib.test_type_check_1.1_3bce48a22055ed20_.log 2024-02-01T01:17:32.2336408Z Running 50 items in this shard: test/torch_np/numpy_tests/lib/test_type_check.py::TestCommonType::test_basic, test/torch_np/numpy_tests/lib/test_type_check.py::TestMintypecode::test_default_1, test/torch_np/numpy_tests/lib/test_type_check.py::TestMintypecode::test_default_2, test/torch_np/numpy_tests/lib/test_type_check.py::TestMintypecode::test_default_3, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsscalar::test_basic, test/torch_np/numpy_tests/lib/test_type_check.py::TestReal::test_cmplx, test/torch_np/numpy_tests/lib/test_type_check.py::TestReal::test_real, test/torch_np/numpy_tests/lib/test_type_check.py::TestImag::test_cmplx, test/torch_np/numpy_tests/lib/test_type_check.py::TestImag::test_real, test/torch_np/numpy_tests/lib/test_type_check.py::TestIscomplex::test_fail, test/torch_np/numpy_tests/lib/test_type_check.py::TestIscomplex::test_pass, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsreal::test_fail, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsreal::test_isreal_real, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsreal::test_pass, test/torch_np/numpy_tests/lib/test_type_check.py::TestIscomplexobj::test_basic, test/torch_np/numpy_tests/lib/test_type_check.py::TestIscomplexobj::test_list, test/torch_np/numpy_tests/lib/test_type_check.py::TestIscomplexobj::test_scalar, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsrealobj::test_basic, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_complex, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_complex1, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_goodvalues, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_ind, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_integer, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_neginf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsnan::test_posinf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_complex, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_complex1, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_goodvalues, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_ind, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_integer, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_neginf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsfinite::test_posinf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_goodvalues, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_ind, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_neginf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_neginf_scalar, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_posinf, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsinf::test_posinf_scalar, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsposinf::test_generic, test/torch_np/numpy_tests/lib/test_type_check.py::TestIsneginf::test_generic, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_array, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_complex_bad, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_complex_bad2, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_complex_good, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_do_not_rewrite_previous_keyword, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_float, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_generic, test/torch_np/numpy_tests/lib/test_type_check.py::TestNanToNum::test_integer, test/torch_np/numpy_tests/lib/test_type_check.py::TestRealIfClose::test_basic, test/torch_np/numpy_tests/lib/test_type_check.py::TestArrayConversion::test_asfarray 2024-02-01T01:17:32.2354874Z 2024-02-01T01:17:32.2355260Z Running dynamo/test_comptime 1/1 ... [2024-02-01 01:17:32.231551] 2024-02-01T01:17:32.2356949Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_comptime.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:32.231778] 2024-02-01T01:17:35.0008245Z 2024-02-01T01:17:35.0009871Z dynamo/test_comptime 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_comptime_1.1_2b337daaf0c9bf75_.log 2024-02-01T01:17:35.0014034Z Running 8 items in this shard: test/dynamo/test_comptime.py::ComptimeTests::test_get_local, test/dynamo/test_comptime.py::ComptimeTests::test_graph_break, test/dynamo/test_comptime.py::ComptimeTests::test_print_bt, test/dynamo/test_comptime.py::ComptimeTests::test_print_disas, test/dynamo/test_comptime.py::ComptimeTests::test_print_graph, test/dynamo/test_comptime.py::ComptimeTests::test_print_guards, test/dynamo/test_comptime.py::ComptimeTests::test_print_locals, test/dynamo/test_comptime.py::ComptimeTests::test_print_value_stack 2024-02-01T01:17:35.0016701Z 2024-02-01T01:17:35.0017088Z Running nn/test_multihead_attention 1/1 ... [2024-02-01 01:17:35.000889] 2024-02-01T01:17:35.0018863Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_multihead_attention.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:35.001137] 2024-02-01T01:17:42.0273185Z 2024-02-01T01:17:42.0274771Z nn/test_multihead_attention 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_multihead_attention_1.1_126017fe4dfd4677_.log 2024-02-01T01:17:42.0287981Z Running 19 items in this shard: test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attention_average_attn_weights_False, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attention_average_attn_weights_True, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attn_3d_attn_mask, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attn_fast_path_invalid_shape, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attn_invalid_shape, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attn_nested_tensor_outside_fast_path, test/nn/test_multihead_attention.py::TestMultiheadAttentionNN::test_multihead_attn_no_bias, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_batch_first_cuda_float16, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_batch_first_cuda_float32, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_batch_first_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_cuda_float16, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_cuda_float32, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attention_dtype_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attn_fast_path_query_and_bias_have_different_dtypes_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attn_fast_path_small_test_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attn_in_proj_bias_none_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_attn_in_proj_weight_none_cuda_float64, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_self_attn_two_masks_fast_path_cuda, test/nn/test_multihead_attention.py::TestMultiheadAttentionNNDeviceTypeCUDA::test_multihead_self_attn_two_masks_fast_path_mock_cuda 2024-02-01T01:17:42.0299325Z 2024-02-01T01:17:42.0299697Z Running dynamo/test_cudagraphs 1/1 ... [2024-02-01 01:17:42.027564] 2024-02-01T01:17:42.0301404Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_cudagraphs.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:42.027765] 2024-02-01T01:17:46.4500868Z 2024-02-01T01:17:46.4502425Z dynamo/test_cudagraphs 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_cudagraphs_1.1_41f026e19d487eae_.log 2024-02-01T01:17:46.4506479Z Running 8 items in this shard: test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_basic, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_dead_fill, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_dtoh, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_factory, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_htod, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_mutate_constant, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_mutate_input, test/dynamo/test_cudagraphs.py::TestAotCudagraphs::test_mutated_metadata 2024-02-01T01:17:46.4509249Z 2024-02-01T01:17:46.4509740Z Starting test batch 'unranked_relevance' 10916.87148475647 seconds after initiating testing 2024-02-01T01:17:46.4510451Z With sharding, this batch will run 4 tests 2024-02-01T01:17:48.0056155Z Running inductor/test_cuda_cpp_wrapper 1/3 ... [2024-02-01 01:17:48.005100] 2024-02-01T01:17:48.0058421Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_cpp_wrapper.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:17:48.005373] 2024-02-01T01:25:26.2470608Z 2024-02-01T01:25:26.2473115Z inductor/test_cuda_cpp_wrapper 1/3 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_cpp_wrapper_1.3_7c656ded6810c4c5_.log 2024-02-01T01:25:26.2490338Z Running 25 items in this shard: test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_batch_norm_2d_2_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_bernoulli1_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_cat_slice_cat_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_foreach_cpp_wrapper_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_index_put_deterministic_fallback_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_profiler_mark_wrapper_call_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_sum_dtype_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_adding_tensor_offsets_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_addmm_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_batch_norm_2d_2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_bernoulli1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_bitwise_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_bmm1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_bmm2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_conv_backward_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_index_put_deterministic_fallback_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_linear1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_linear2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_linear_relu_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_mm_views_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_relu_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_repeat_interleave_2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_scalar_input_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_sort_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_sum_dtype_cuda_dynamic_shapes_cuda_wrapper 2024-02-01T01:25:26.2504551Z 2024-02-01T01:25:26.2504961Z Running inductor/test_aot_inductor_utils 1/1 ... [2024-02-01 01:25:26.247465] 2024-02-01T01:25:26.2506752Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:25:26.247692] 2024-02-01T01:25:28.0914835Z 2024-02-01T01:25:28.0916676Z inductor/test_aot_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_utils_1.1_8d8a5ef9f0a31ee2_.log 2024-02-01T01:25:28.0918116Z 2024-02-01T01:25:28.0918568Z Running lazy/test_bindings 1/1 ... [2024-02-01 01:25:28.091398] 2024-02-01T01:25:28.0921013Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_bindings.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:25:28.091616] 2024-02-01T01:25:29.3010099Z 2024-02-01T01:25:29.3011785Z lazy/test_bindings 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_bindings_1.1_322f384479fd708b_.log 2024-02-01T01:25:29.3012737Z 2024-02-01T01:25:29.3013506Z Running optim/test_optim 1/1 ... [2024-02-01 01:25:29.300992] 2024-02-01T01:25:29.3015776Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_optim.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-02-01 01:25:29.301207] 2024-02-01T01:25:30.9542583Z 2024-02-01T01:25:30.9544062Z optim/test_optim 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_optim_1.1_459db9b990c832c9_.log 2024-02-01T01:25:30.9545191Z 2024-02-01T01:25:30.9545697Z Starting test batch 'unlikely_relevance' 11381.37559723854 seconds after initiating testing 2024-02-01T01:25:30.9546530Z With sharding, this batch will run 0 tests 2024-02-01T01:25:30.9547353Z Starting test batch 'none_relevance' 11381.375631093979 seconds after initiating testing 2024-02-01T01:25:30.9548129Z With sharding, this batch will run 0 tests 2024-02-01T01:25:31.4438077Z Emiting td_test_failure_stats 2024-02-01T01:25:33.2869768Z inductor/test_torchinductor_dynamic_shapes 1/3 failed! 2024-02-01T01:25:33.6025316Z 2024-02-01T01:25:33.6025939Z real 189m46.691s 2024-02-01T01:25:33.6026315Z user 192m4.443s 2024-02-01T01:25:33.6026655Z sys 11m18.260s 2024-02-01T01:25:33.6065084Z ##[error]Process completed with exit code 1. 2024-02-01T01:25:33.6139712Z Prepare all required actions 2024-02-01T01:25:33.6140133Z Getting action download info 2024-02-01T01:25:33.9402108Z ##[group]Run ./.github/actions/pytest-cache-upload 2024-02-01T01:25:33.9402565Z with: 2024-02-01T01:25:33.9402854Z cache_dir: .pytest_cache 2024-02-01T01:25:33.9403210Z shard: 1 2024-02-01T01:25:33.9403554Z sha: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-02-01T01:25:33.9404008Z test_config: default 2024-02-01T01:25:33.9404515Z job_identifier: slow_linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-02-01T01:25:33.9405495Z env: 2024-02-01T01:25:33.9405778Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:33.9406240Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:33.9406978Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:33.9407637Z ##[endgroup] 2024-02-01T01:25:33.9448540Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-02-01T01:25:33.9449102Z with: 2024-02-01T01:25:33.9449409Z shell: bash 2024-02-01T01:25:33.9449705Z timeout_minutes: 5 2024-02-01T01:25:33.9450031Z max_attempts: 5 2024-02-01T01:25:33.9450344Z retry_wait_seconds: 30 2024-02-01T01:25:33.9450797Z command: set -eu python3 -m pip install boto3==1.19.12 2024-02-01T01:25:33.9451313Z polling_interval_seconds: 1 2024-02-01T01:25:33.9451689Z warning_on_retry: true 2024-02-01T01:25:33.9452030Z continue_on_error: false 2024-02-01T01:25:33.9452366Z env: 2024-02-01T01:25:33.9452636Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:33.9453084Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:33.9453817Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:33.9454452Z ##[endgroup] 2024-02-01T01:25:34.2015029Z Defaulting to user installation because normal site-packages is not writeable 2024-02-01T01:25:35.2164262Z Collecting boto3==1.19.12 2024-02-01T01:25:35.2337787Z Downloading boto3-1.19.12-py3-none-any.whl (131 kB) 2024-02-01T01:25:35.2745731Z Collecting jmespath<1.0.0,>=0.7.1 2024-02-01T01:25:35.2785993Z Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB) 2024-02-01T01:25:36.4542142Z Collecting botocore<1.23.0,>=1.22.12 2024-02-01T01:25:36.4619096Z Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB) 2024-02-01T01:25:36.6118920Z Collecting s3transfer<0.6.0,>=0.5.0 2024-02-01T01:25:36.6157063Z Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB) 2024-02-01T01:25:36.6610806Z Collecting python-dateutil<3.0.0,>=2.1 2024-02-01T01:25:36.6648855Z Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) 2024-02-01T01:25:36.6766986Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.18) 2024-02-01T01:25:36.7260934Z Collecting six>=1.5 2024-02-01T01:25:36.7296540Z Downloading six-1.16.0-py2.py3-none-any.whl (11 kB) 2024-02-01T01:25:36.7947158Z Installing collected packages: jmespath, six, python-dateutil, botocore, s3transfer, boto3 2024-02-01T01:25:37.4261332Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 2024-02-01T01:25:37.9967244Z Command completed after 1 attempt(s). 2024-02-01T01:25:38.0010684Z ##[group]Run python3 .github/scripts/pytest_cache.py \ 2024-02-01T01:25:38.0011287Z python3 .github/scripts/pytest_cache.py \ 2024-02-01T01:25:38.0011772Z  --upload \ 2024-02-01T01:25:38.0012185Z  --cache_dir $GITHUB_WORKSPACE/$CACHE_DIR \ 2024-02-01T01:25:38.0012698Z  --pr_identifier $GITHUB_REF \ 2024-02-01T01:25:38.0013190Z  --job_identifier $JOB_IDENTIFIER \ 2024-02-01T01:25:38.0013644Z  --sha $SHA \ 2024-02-01T01:25:38.0014025Z  --test_config $TEST_CONFIG \ 2024-02-01T01:25:38.0014448Z  --shard $SHARD \ 2024-02-01T01:25:38.0014828Z  --repo $REPO \ 2024-02-01T01:25:38.0015212Z  --temp_dir $RUNNER_TEMP \ 2024-02-01T01:25:38.0023213Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.0023714Z env: 2024-02-01T01:25:38.0023991Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.0024436Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.0025169Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.0025821Z CACHE_DIR: .pytest_cache 2024-02-01T01:25:38.0026331Z JOB_IDENTIFIER: slow_linux-focal-cuda12.1-py3-gcc9-slow-gradcheck 2024-02-01T01:25:38.0026942Z SHA: 86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528 2024-02-01T01:25:38.0027387Z TEST_CONFIG: default 2024-02-01T01:25:38.0027703Z SHARD: 1 2024-02-01T01:25:38.0027986Z REPO: pytorch/pytorch 2024-02-01T01:25:38.0028314Z ##[endgroup] 2024-02-01T01:25:38.3546778Z PR identifier for `refs/tags/ciflow/slow/118586` is `947d8e0bc6cf761760bfa72417503bf0` 2024-02-01T01:25:38.3550347Z Uploading cache with args Namespace(bucket=None, cache_dir='/home/ec2-user/actions-runner/_work/pytorch/pytorch/.pytest_cache', download=False, job_identifier='slow_linux-focal-cuda12.1-py3-gcc9-slow-gradcheck', pr_identifier='refs/tags/ciflow/slow/118586', repo='pytorch/pytorch', sha='86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528', shard='1', temp_dir='/home/ec2-user/actions-runner/_work/_temp', test_config='default', upload=True) 2024-02-01T01:25:38.3552995Z Zipping /home/ec2-user/actions-runner/_work/pytorch/pytorch/.pytest_cache 2024-02-01T01:25:38.3554745Z to /home/ec2-user/actions-runner/_work/_temp/zip-upload/pytest_cache/pytorch/pytorch/947d8e0bc6cf761760bfa72417503bf0/slow_linux-focal-cuda12_1-py3-gcc9-slow-gradcheck/86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528/default/1 2024-02-01T01:25:38.3557303Z Uploading /home/ec2-user/actions-runner/_work/_temp/zip-upload/pytest_cache/pytorch/pytorch/947d8e0bc6cf761760bfa72417503bf0/slow_linux-focal-cuda12_1-py3-gcc9-slow-gradcheck/86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528/default/1.zip 2024-02-01T01:25:38.3559914Z to s3://gha-artifacts/pytest_cache/pytorch/pytorch/947d8e0bc6cf761760bfa72417503bf0/slow_linux-focal-cuda12_1-py3-gcc9-slow-gradcheck/86d2ab39da7d7784e12e0e4a7ad4a9b5d1f7c528/default/1.zip 2024-02-01T01:25:38.3907538Z ##[group]Run cat test/**/*_toprint.log || true 2024-02-01T01:25:38.3908044Z cat test/**/*_toprint.log || true 2024-02-01T01:25:38.3915977Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.3916482Z env: 2024-02-01T01:25:38.3916760Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.3917212Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.3917945Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.3918588Z ##[endgroup] 2024-02-01T01:25:38.3949144Z cat: test/**/*_toprint.log: No such file or directory 2024-02-01T01:25:38.3977458Z ##[group]Run kill "$MONITOR_SCRIPT_PID" 2024-02-01T01:25:38.3977939Z kill "$MONITOR_SCRIPT_PID" 2024-02-01T01:25:38.3985594Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.3986089Z env: 2024-02-01T01:25:38.3986357Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.3986811Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.3987538Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.3988193Z MONITOR_SCRIPT_PID: 18746 2024-02-01T01:25:38.3988546Z ##[endgroup] 2024-02-01T01:25:38.4147067Z Prepare all required actions 2024-02-01T01:25:38.4147519Z Getting action download info 2024-02-01T01:25:38.5369044Z Download action repository 'actions/upload-artifact@v3' (SHA:a8a3f3ad30e3422c9c7b888a15615d19a852ae32) 2024-02-01T01:25:38.7002461Z ##[group]Run ./.github/actions/upload-test-artifacts 2024-02-01T01:25:38.7002969Z with: 2024-02-01T01:25:38.7003474Z file-suffix: test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405 2024-02-01T01:25:38.7004104Z env: 2024-02-01T01:25:38.7004385Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.7005039Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.7005785Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.7006415Z ##[endgroup] 2024-02-01T01:25:38.7035432Z ##[group]Run # Remove any previous test jsons if they exist 2024-02-01T01:25:38.7036057Z # Remove any previous test jsons if they exist 2024-02-01T01:25:38.7036614Z rm -f test-jsons-*.zip 2024-02-01T01:25:38.7037140Z zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json' 2024-02-01T01:25:38.7045287Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.7045791Z env: 2024-02-01T01:25:38.7046069Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.7046514Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.7047246Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.7048093Z FILE_SUFFIX: test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405 2024-02-01T01:25:38.7048667Z ##[endgroup] 2024-02-01T01:25:38.7142245Z adding: test/allowlist_for_publicAPI.json (deflated 79%) 2024-02-01T01:25:38.7171360Z adding: test/benchmark_utils/callgrind_artifacts.json (deflated 92%) 2024-02-01T01:25:38.7172205Z adding: test/minioptest_failures_dict.json (deflated 70%) 2024-02-01T01:25:38.7178002Z adding: test/profiler/profiler_utils_mock_events.json (deflated 87%) 2024-02-01T01:25:38.7180310Z adding: test/.pytorch-slow-tests.json (deflated 77%) 2024-02-01T01:25:38.7186323Z adding: test/.pytorch-disabled-tests.json (deflated 85%) 2024-02-01T01:25:38.7210173Z ##[group]Run # Remove any previous test reports if they exist 2024-02-01T01:25:38.7210822Z # Remove any previous test reports if they exist 2024-02-01T01:25:38.7211355Z rm -f test-reports-*.zip 2024-02-01T01:25:38.7211946Z zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml' -i '*.csv' 2024-02-01T01:25:38.7219241Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.7219728Z env: 2024-02-01T01:25:38.7220007Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.7220459Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.7221192Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.7222109Z FILE_SUFFIX: test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405 2024-02-01T01:25:38.7222706Z ##[endgroup] 2024-02-01T01:25:38.7301225Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-13372485e47c0487.xml (deflated 90%) 2024-02-01T01:25:38.7303046Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-3c043064e50e2c06.xml (deflated 86%) 2024-02-01T01:25:38.7304781Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-d393e3907111f09e.xml (deflated 86%) 2024-02-01T01:25:38.7306538Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-ddb948e49ef1ed3f.xml (deflated 51%) 2024-02-01T01:25:38.7386063Z adding: test/test-reports/python-pytest/dynamo.test_dynamic_shapes/dynamo.test_dynamic_shapes-8a955caba7a0d18f.xml (deflated 93%) 2024-02-01T01:25:38.7419183Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-a5bf0c8f739a7a28.xml (deflated 93%) 2024-02-01T01:25:38.7432105Z adding: test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-d1ae96f972c67cb5.xml (deflated 95%) 2024-02-01T01:25:38.7444536Z adding: test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-06ca1542eb4d534e.xml (deflated 95%) 2024-02-01T01:25:38.7457911Z adding: test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-cb42106d481e6602.xml (deflated 95%) 2024-02-01T01:25:38.7915988Z adding: test/test-reports/python-pytest/test_transformers/test_transformers-e0123797b33775f9.xml (deflated 98%) 2024-02-01T01:25:38.7923371Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-3f66acb149ddff98.xml (deflated 93%) 2024-02-01T01:25:38.7930773Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-17ad0ce98c1516fd.xml (deflated 94%) 2024-02-01T01:25:38.7936762Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-2868be99b0409e8b.xml (deflated 93%) 2024-02-01T01:25:38.7938222Z adding: test/test-reports/python-pytest/test_cuda_multigpu/test_cuda_multigpu-177ae8d77db2629e.xml (deflated 91%) 2024-02-01T01:25:38.7994343Z adding: test/test-reports/python-pytest/test_nn/test_nn-c7872bd11fb988cc.xml (deflated 97%) 2024-02-01T01:25:38.7996731Z adding: test/test-reports/python-pytest/test_public_bindings/test_public_bindings-28ed54477c67e2d3.xml (deflated 58%) 2024-02-01T01:25:38.7999360Z adding: test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-2e9b85928380fddf.xml (deflated 82%) 2024-02-01T01:25:38.8009637Z adding: test/test-reports/python-pytest/test_modules/test_modules-ce2fc7fb2fc2e1eb.xml (deflated 94%) 2024-02-01T01:25:38.8034141Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-1a0861ade760225e.xml (deflated 95%) 2024-02-01T01:25:38.8036779Z adding: test/test-reports/python-pytest/inductor.test_cuda_repro/inductor.test_cuda_repro-a7f63f4b6e4db924.xml (deflated 80%) 2024-02-01T01:25:38.8125533Z adding: test/test-reports/python-pytest/test_utils/test_utils-3882a2386c4ffd67.xml (deflated 98%) 2024-02-01T01:25:38.8128325Z adding: test/test-reports/python-pytest/inductor.test_max_autotune/inductor.test_max_autotune-78a3e3989bb7acd5.xml (deflated 85%) 2024-02-01T01:25:38.8130893Z adding: test/test-reports/python-pytest/inductor.test_cpu_repro/inductor.test_cpu_repro-c535a2c45f526169.xml (deflated 85%) 2024-02-01T01:25:38.8133337Z adding: test/test-reports/python-pytest/inductor.test_fused_attention/inductor.test_fused_attention-5e24ddf930c66dfa.xml (deflated 95%) 2024-02-01T01:25:38.8136156Z adding: test/test-reports/python-pytest/dynamo.test_modules/dynamo.test_modules-7bb724d9f75fbd0a.xml (deflated 91%) 2024-02-01T01:25:38.8141837Z adding: test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-98ac4b386e957512.xml (deflated 93%) 2024-02-01T01:25:38.8153788Z adding: test/test-reports/python-pytest/dynamo.test_minifier/dynamo.test_minifier-217a43c040a33b52.xml (deflated 95%) 2024-02-01T01:25:38.8155218Z adding: test/test-reports/python-pytest/inductor.test_inductor_freezing/inductor.test_inductor_freezing-97cd81a83f72072e.xml (deflated 83%) 2024-02-01T01:25:38.8156721Z adding: test/test-reports/python-pytest/inductor.test_triton_wrapper/inductor.test_triton_wrapper-1fbffcd882e52bb6.xml (deflated 44%) 2024-02-01T01:25:38.8158901Z adding: test/test-reports/python-pytest/export.test_export/export.test_export-b93907494fd79491.xml (deflated 85%) 2024-02-01T01:25:38.8160630Z adding: test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-b5fed57870351b2c.xml (deflated 86%) 2024-02-01T01:25:38.8162012Z adding: test/test-reports/python-pytest/inductor.test_layout_optim/inductor.test_layout_optim-ef9d61b255e718f0.xml (deflated 83%) 2024-02-01T01:25:38.8163378Z adding: test/test-reports/python-pytest/dynamo.test_unspec/dynamo.test_unspec-98e38e4f6f4d5c54.xml (deflated 81%) 2024-02-01T01:25:38.8172146Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-78746ff757999f87.xml (deflated 94%) 2024-02-01T01:25:38.8173488Z adding: test/test-reports/python-pytest/export.test_passes/export.test_passes-4993c16592176804.xml (deflated 74%) 2024-02-01T01:25:38.8174810Z adding: test/test-reports/python-pytest/dynamo.test_subclasses/dynamo.test_subclasses-ec984f207096dca0.xml (deflated 87%) 2024-02-01T01:25:38.8179967Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-b87237f8c345bae8.xml (deflated 94%) 2024-02-01T01:25:38.8181286Z adding: test/test-reports/python-pytest/dynamo.test_backends/dynamo.test_backends-6a250ae7df122803.xml (deflated 82%) 2024-02-01T01:25:38.8195943Z adding: test/test-reports/python-pytest/dynamo.test_after_aot/dynamo.test_after_aot-be3665db9f409c12.xml (deflated 48%) 2024-02-01T01:25:38.8197308Z adding: test/test-reports/python-pytest/inductor.test_config/inductor.test_config-678893ad10592ff1.xml (deflated 84%) 2024-02-01T01:25:38.8198827Z adding: test/test-reports/python-pytest/inductor.test_coordinate_descent_tuner/inductor.test_coordinate_descent_tuner-102a243b0b267a64.xml (deflated 59%) 2024-02-01T01:25:38.8223996Z adding: test/test-reports/python-pytest/test_schema_check/test_schema_check-74d1e5ad61fd973b.xml (deflated 95%) 2024-02-01T01:25:38.8228071Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-fbdef917b7493475.xml (deflated 94%) 2024-02-01T01:25:38.8229491Z adding: test/test-reports/python-pytest/test_model_exports_to_core_aten/test_model_exports_to_core_aten-2362d3bd125953ee.xml (deflated 58%) 2024-02-01T01:25:38.8230933Z adding: test/test-reports/python-pytest/dynamo.test_pre_dispatch/dynamo.test_pre_dispatch-6e5b1290c02d3a46.xml (deflated 76%) 2024-02-01T01:25:38.8232309Z adding: test/test-reports/python-pytest/export.test_safeguard/export.test_safeguard-e36c0a7019f63bec.xml (deflated 58%) 2024-02-01T01:25:38.8233944Z adding: test/test-reports/python-pytest/torch_np.numpy_tests.lib.test_type_check/torch_np.numpy_tests.lib.test_type_check-773fc089e5faaed2.xml (deflated 90%) 2024-02-01T01:25:38.8235454Z adding: test/test-reports/python-pytest/dynamo.test_comptime/dynamo.test_comptime-f0a15858a9aecf18.xml (deflated 79%) 2024-02-01T01:25:38.8237148Z adding: test/test-reports/python-pytest/nn.test_multihead_attention/nn.test_multihead_attention-71b98d1d6fec9784.xml (deflated 80%) 2024-02-01T01:25:38.8238663Z adding: test/test-reports/python-pytest/dynamo.test_cudagraphs/dynamo.test_cudagraphs-88bc0b78b6cd282f.xml (deflated 86%) 2024-02-01T01:25:38.8240138Z adding: test/test-reports/python-pytest/inductor.test_cuda_cpp_wrapper/inductor.test_cuda_cpp_wrapper-c5317fae537d2232.xml (deflated 80%) 2024-02-01T01:25:38.8262702Z ##[group]Run # Remove any previous test reports if they exist 2024-02-01T01:25:38.8263360Z # Remove any previous test reports if they exist 2024-02-01T01:25:38.8263929Z rm -f logs-*.zip 2024-02-01T01:25:38.8264622Z # this workflow is also run in bazel build test, but we dont generate usage reports for it 2024-02-01T01:25:38.8265418Z # so check to see if the file exists first 2024-02-01T01:25:38.8265927Z if [ -f 'usage_log.txt' ]; then 2024-02-01T01:25:38.8266467Z  zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' 2024-02-01T01:25:38.8266959Z fi 2024-02-01T01:25:38.8267321Z if ls test/**/*.log 1> /dev/null 2>&1; then 2024-02-01T01:25:38.8267920Z  zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log' 2024-02-01T01:25:38.8268421Z fi 2024-02-01T01:25:38.8275752Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:38.8276238Z env: 2024-02-01T01:25:38.8276520Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:38.8276994Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:38.8277740Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:38.8278672Z FILE_SUFFIX: test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405 2024-02-01T01:25:38.8279259Z ##[endgroup] 2024-02-01T01:25:38.9915129Z adding: usage_log.txt (deflated 98%) 2024-02-01T01:25:38.9992334Z adding: test/test-reports/td_heuristic_rankings.log (deflated 82%) 2024-02-01T01:25:38.9993337Z adding: test/test-reports/inductor.test_coordinate_descent_tuner_1.1_fb67dfe4f0b2bf76_.log (deflated 66%) 2024-02-01T01:25:39.0008904Z adding: test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.3_b89fba1a042240da_.log (deflated 91%) 2024-02-01T01:25:39.0010608Z adding: test/test-reports/dynamo.test_subclasses_1.1_dad467b748106687_.log (deflated 83%) 2024-02-01T01:25:39.0051308Z adding: test/test-reports/dynamo.test_dynamic_shapes_1.1_4022e3a8f05c11da_.log (deflated 91%) 2024-02-01T01:25:39.0091454Z adding: test/test-reports/functorch.test_ops_3.6_a13071104b9c19b1_.log (deflated 92%) 2024-02-01T01:25:39.0106380Z adding: test/test-reports/test_ops_gradients_1.10_c6d65a74233a03fa_.log (deflated 92%) 2024-02-01T01:25:39.0121555Z adding: test/test-reports/test_ops_gradients_5.10_e99ab14062b89ae5_.log (deflated 92%) 2024-02-01T01:25:39.0137126Z adding: test/test-reports/test_ops_gradients_9.10_054b80417cab068c_.log (deflated 92%) 2024-02-01T01:25:39.0868043Z adding: test/test-reports/test_transformers_1.3_bf2b326bc96b49ef_.log (deflated 98%) 2024-02-01T01:25:39.0880111Z adding: test/test-reports/test_cpp_api_parity_1.1_769febd81468634d_.log (deflated 94%) 2024-02-01T01:25:39.0889920Z adding: test/test-reports/test_ops_fwd_gradients_2.10_66d64e346e9df668_.log (deflated 92%) 2024-02-01T01:25:39.0890926Z adding: test/test-reports/dynamo.test_backends_1.1_ce002c65cfb8a3fb_.log (deflated 74%) 2024-02-01T01:25:39.0900172Z adding: test/test-reports/test_ops_fwd_gradients_6.10_8a3cf8042cc3d719_.log (deflated 92%) 2024-02-01T01:25:39.0901257Z adding: test/test-reports/dynamo.test_pre_dispatch_1.1_e927df64dc23ec4f_.log (deflated 59%) 2024-02-01T01:25:39.0908667Z adding: test/test-reports/test_ops_fwd_gradients_10.10_b6b87c8eb18e0c60_.log (deflated 91%) 2024-02-01T01:25:39.0910010Z adding: test/test-reports/test_cuda_multigpu_1.1_bfbba6a84a5d5188_.log (deflated 85%) 2024-02-01T01:25:39.0975219Z adding: test/test-reports/test_nn_1.1_94ac28c60a1ac4cb_.log (deflated 95%) 2024-02-01T01:25:39.0981637Z adding: test/test-reports/nn.test_convolution_3.3_97de5273c09bd022_.log (deflated 94%) 2024-02-01T01:25:39.0982764Z adding: test/test-reports/test_public_bindings_1.1_a7609f7497d6e925_.log (deflated 59%) 2024-02-01T01:25:39.0983780Z adding: test/test-reports/export.test_safeguard_1.1_d9415d74b8a0f50e_.log (deflated 59%) 2024-02-01T01:25:39.0984757Z adding: test/test-reports/test_multiprocessing_1.1_13141a73dff7db27_.log (deflated 80%) 2024-02-01T01:25:39.1005657Z adding: test/test-reports/test_modules_1.3_c3997692a50489d0_.log (deflated 93%) 2024-02-01T01:25:39.1047599Z adding: test/test-reports/test_sparse_csr_2.3_01c9fb68a17c872f_.log (deflated 94%) 2024-02-01T01:25:39.1048576Z adding: test/test-reports/dynamo.test_comptime_1.1_2b337daaf0c9bf75_.log (deflated 68%) 2024-02-01T01:25:39.1049947Z adding: test/test-reports/inductor.test_cuda_repro_1.1_3b8cfbddec01ee0b_.log (deflated 84%) 2024-02-01T01:25:39.1163279Z adding: test/test-reports/test_utils_1.1_e1b414475baab95c_.log (deflated 96%) 2024-02-01T01:25:39.1164238Z adding: test/test-reports/lazy.test_bindings_1.1_322f384479fd708b_.log (stored 0%) 2024-02-01T01:25:39.1165446Z adding: test/test-reports/inductor.test_max_autotune_1.1_9abe6235ce9a3f04_.log (deflated 84%) 2024-02-01T01:25:39.1166494Z adding: test/test-reports/dynamo.test_cudagraphs_1.1_41f026e19d487eae_.log (deflated 69%) 2024-02-01T01:25:39.1168828Z adding: test/test-reports/inductor.test_cpu_repro_1.1_c9a55710b62dae71_.log (deflated 83%) 2024-02-01T01:25:39.1169828Z adding: test/test-reports/dynamo.test_unspec_1.1_56933a6e635a7bbb_.log (deflated 77%) 2024-02-01T01:25:39.1172174Z adding: test/test-reports/inductor.test_fused_attention_1.1_d50cfb2f1b2deb0f_.log (deflated 90%) 2024-02-01T01:25:39.1175358Z adding: test/test-reports/dynamo.test_modules_1.1_e570c46d6bddb566_.log (deflated 85%) 2024-02-01T01:25:39.1176318Z adding: test/test-reports/optim.test_optim_1.1_459db9b990c832c9_.log (deflated 7%) 2024-02-01T01:25:39.1181365Z adding: test/test-reports/dynamo.test_functions_1.1_e8e2c654a4c91fd2_.log (deflated 89%) 2024-02-01T01:25:39.1184972Z adding: test/test-reports/dynamo.test_minifier_1.1_a9472406c7688bd7_.log (deflated 93%) 2024-02-01T01:25:39.1197481Z adding: test/test-reports/test_tensor_creation_ops_1.1_c5ba8e04cbce3b99_.log (deflated 93%) 2024-02-01T01:25:39.1198934Z adding: test/test-reports/inductor.test_inductor_freezing_1.1_edd56ea19d17f41c_.log (deflated 85%) 2024-02-01T01:25:39.1200149Z adding: test/test-reports/export.test_passes_1.1_c42e545a756f66ae_.log (deflated 73%) 2024-02-01T01:25:39.1201180Z adding: test/test-reports/inductor.test_triton_wrapper_1.1_fb0c84dfb1ac3649_.log (deflated 53%) 2024-02-01T01:25:39.1203357Z adding: test/test-reports/export.test_export_1.1_09fddb13d4b6a652_.log (deflated 86%) 2024-02-01T01:25:39.1206542Z adding: test/test-reports/test_python_dispatch_1.1_85676099dbac8bdb_.log (deflated 87%) 2024-02-01T01:25:39.1207840Z adding: test/test-reports/inductor.test_layout_optim_1.1_131ec340484235f0_.log (deflated 70%) 2024-02-01T01:25:39.1209012Z adding: test/test-reports/dynamo.test_after_aot_1.1_7ace6479d266a5b1_.log (deflated 54%) 2024-02-01T01:25:39.1210010Z adding: test/test-reports/inductor.test_config_1.1_a6b528fa27962db9_.log (deflated 72%) 2024-02-01T01:25:39.1278327Z adding: test/test-reports/test_schema_check_1.2_81ca1644eb743c20_.log (deflated 95%) 2024-02-01T01:25:39.1279355Z adding: test/test-reports/test_model_exports_to_core_aten_1.1_c635f9cce0015727_.log (deflated 52%) 2024-02-01T01:25:39.1280485Z adding: test/test-reports/torch_np.numpy_tests.lib.test_type_check_1.1_3bce48a22055ed20_.log (deflated 86%) 2024-02-01T01:25:39.1281612Z adding: test/test-reports/nn.test_multihead_attention_1.1_126017fe4dfd4677_.log (deflated 83%) 2024-02-01T01:25:39.1284534Z adding: test/test-reports/inductor.test_cuda_cpp_wrapper_1.3_7c656ded6810c4c5_.log (deflated 92%) 2024-02-01T01:25:39.1285889Z adding: test/test-reports/inductor.test_aot_inductor_utils_1.1_8d8a5ef9f0a31ee2_.log (stored 0%) 2024-02-01T01:25:39.1337560Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-02-01T01:25:39.1338055Z with: 2024-02-01T01:25:39.1338511Z s3-prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:39.1338980Z retention-days: 14 2024-02-01T01:25:39.1339314Z if-no-files-found: warn 2024-02-01T01:25:39.1339663Z path: test-jsons-*.zip 2024-02-01T01:25:39.1340005Z name: artifact 2024-02-01T01:25:39.1340317Z s3-bucket: gha-artifacts 2024-02-01T01:25:39.1340664Z region: us-east-1 2024-02-01T01:25:39.1340952Z env: 2024-02-01T01:25:39.1341307Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:39.1341873Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:39.1342615Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:39.1343244Z ##[endgroup] 2024-02-01T01:25:39.4486454Z NOTE: s3-prefix specified, ignoring name parameter 2024-02-01T01:25:39.4487044Z With the provided path, there will be 1 file uploaded 2024-02-01T01:25:39.4487661Z Uploading to s3 prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:39.4512924Z Starting upload of test-jsons-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:39.5676492Z Finished upload of test-jsons-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:39.5839969Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-02-01T01:25:39.5840414Z with: 2024-02-01T01:25:39.5840761Z s3-prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:39.5841239Z retention-days: 14 2024-02-01T01:25:39.5841585Z if-no-files-found: error 2024-02-01T01:25:39.5841956Z path: test-reports-*.zip 2024-02-01T01:25:39.5842431Z name: artifact 2024-02-01T01:25:39.5842746Z s3-bucket: gha-artifacts 2024-02-01T01:25:39.5843091Z region: us-east-1 2024-02-01T01:25:39.5843389Z env: 2024-02-01T01:25:39.5843663Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:39.5844114Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:39.5845145Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:39.5845788Z ##[endgroup] 2024-02-01T01:25:39.8893350Z NOTE: s3-prefix specified, ignoring name parameter 2024-02-01T01:25:39.8894103Z With the provided path, there will be 1 file uploaded 2024-02-01T01:25:39.8894968Z Uploading to s3 prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:39.8920009Z Starting upload of test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:40.0648398Z Finished upload of test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:40.0784077Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-02-01T01:25:40.0784537Z with: 2024-02-01T01:25:40.0784884Z s3-prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:40.0785357Z retention-days: 14 2024-02-01T01:25:40.0785688Z if-no-files-found: ignore 2024-02-01T01:25:40.0786049Z path: logs-*.zip 2024-02-01T01:25:40.0786353Z name: artifact 2024-02-01T01:25:40.0786657Z s3-bucket: gha-artifacts 2024-02-01T01:25:40.0787003Z region: us-east-1 2024-02-01T01:25:40.0787298Z env: 2024-02-01T01:25:40.0787575Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:40.0788026Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:40.0788762Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:40.0789386Z ##[endgroup] 2024-02-01T01:25:40.3857881Z NOTE: s3-prefix specified, ignoring name parameter 2024-02-01T01:25:40.3858547Z With the provided path, there will be 1 file uploaded 2024-02-01T01:25:40.3859183Z Uploading to s3 prefix: pytorch/pytorch/7731552918/2/artifact 2024-02-01T01:25:40.3883561Z Starting upload of logs-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:40.5948980Z Finished upload of logs-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_21083466405.zip 2024-02-01T01:25:40.6082861Z ##[group]Run # shellcheck disable=SC2156 2024-02-01T01:25:40.6083354Z # shellcheck disable=SC2156 2024-02-01T01:25:40.6084146Z find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2024-02-01T01:25:40.6092485Z shell: /usr/bin/bash -e {0} 2024-02-01T01:25:40.6092829Z env: 2024-02-01T01:25:40.6093106Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:40.6093561Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:40.6094287Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:40.6094921Z ##[endgroup] 2024-02-01T01:25:40.7691814Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-02-01T01:25:40.7692268Z with: 2024-02-01T01:25:40.7692658Z name: coredumps-default-1-4-linux.g5.4xlarge.nvidia.gpu 2024-02-01T01:25:40.7693227Z retention-days: 14 2024-02-01T01:25:40.7693570Z if-no-files-found: ignore 2024-02-01T01:25:40.7693923Z path: ./**/core.[1-9]* 2024-02-01T01:25:40.7694269Z s3-bucket: gha-artifacts 2024-02-01T01:25:40.7694615Z region: us-east-1 2024-02-01T01:25:40.7694900Z env: 2024-02-01T01:25:40.7695175Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:40.7695622Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:40.7696354Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:40.7696989Z ##[endgroup] 2024-02-01T01:25:48.2732770Z No files were found with the provided path: ./**/core.[1-9]*. No artifacts will be uploaded. 2024-02-01T01:25:48.2947930Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2024-02-01T01:25:48.2948486Z with: 2024-02-01T01:25:48.2948740Z env: 2024-02-01T01:25:48.2949022Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:48.2949481Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:48.2950231Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:48.2950864Z ##[endgroup] 2024-02-01T01:25:48.2972217Z ##[group]Run set -eou pipefail 2024-02-01T01:25:48.2972628Z set -eou pipefail 2024-02-01T01:25:48.2972973Z  2024-02-01T01:25:48.2973492Z echo "Holding runner for 2 hours until all ssh sessions have logged out" 2024-02-01T01:25:48.2974144Z for _ in $(seq 1440); do 2024-02-01T01:25:48.2974620Z  # Break if no ssh session exists anymore 2024-02-01T01:25:48.2975172Z  if [ "$(who)" = "" ]; then 2024-02-01T01:25:48.2975574Z  break 2024-02-01T01:25:48.2975909Z  fi 2024-02-01T01:25:48.2976204Z  echo "." 2024-02-01T01:25:48.2976535Z  sleep 5 2024-02-01T01:25:48.2976860Z done 2024-02-01T01:25:48.2984492Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:48.2985027Z env: 2024-02-01T01:25:48.2985331Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:48.2985783Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:48.2986532Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:48.2987172Z ##[endgroup] 2024-02-01T01:25:48.3009885Z Holding runner for 2 hours until all ssh sessions have logged out 2024-02-01T01:25:48.3045604Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2024-02-01T01:25:48.3046467Z # ignore expansion of "docker ps -q" since it could be empty 2024-02-01T01:25:48.3047057Z # shellcheck disable=SC2046 2024-02-01T01:25:48.3047532Z docker stop $(docker ps -q) || true 2024-02-01T01:25:48.3048026Z # Prune all of the docker images 2024-02-01T01:25:48.3048481Z docker system prune -af 2024-02-01T01:25:48.3055351Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:48.3055862Z env: 2024-02-01T01:25:48.3056145Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:48.3056607Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:48.3057368Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:48.3058011Z ##[endgroup] 2024-02-01T01:25:48.8470164Z 9c1b86787841 2024-02-01T01:25:49.5766100Z Deleted Containers: 2024-02-01T01:25:49.5766915Z 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:49.5767347Z 2024-02-01T01:25:53.6816580Z Deleted Images: 2024-02-01T01:25:53.6818255Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9:3266cf2f0fa6c50bf5453a6ccbc1904de753c35f 2024-02-01T01:25:53.6820673Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9@sha256:2b2cf3c3f6d85ed39aac322cdfac02285cd0a07904c8f9184f94a19e3a0e9d49 2024-02-01T01:25:53.6822543Z deleted: sha256:7d78b20012855630db29f7726cc638698c783297a18547e4ffbae45c1da59435 2024-02-01T01:25:53.6823383Z deleted: sha256:cd0e35e521deed8630d2819afa0cb6174ff3faeef71eedb416016c3bedb24c06 2024-02-01T01:25:53.6824260Z deleted: sha256:bca9fccd967fe8bae87ce01403f4da4df8891df9579763477deddf77e56e1f1d 2024-02-01T01:25:53.6825115Z deleted: sha256:c8789e1140e1286545e662f076bf7b7efcfec96df46ec79af4456de744dd9cf9 2024-02-01T01:25:53.6825951Z deleted: sha256:5d8fa8c9e13b4ea92fc665fd60d44e5fe792aae06961ece98ff36b8399a65a1a 2024-02-01T01:25:53.6826849Z deleted: sha256:de093ce8543519503d884116798f65080107daeba9a3b1a5ecf24ac21db90b03 2024-02-01T01:25:53.6827668Z deleted: sha256:3d476f27e8687eea57f71c8e65ecbb165ac1e56f077f41cc2268f1a964142071 2024-02-01T01:25:53.6828621Z deleted: sha256:b7eef3a1c663af1c9949d67b22bc4eef07e252714beb45b8939befec94b00bdc 2024-02-01T01:25:53.6829470Z deleted: sha256:db33ed7ceb46882902c457aa6aaeb4576ff3099023a68c9d14adc3bd2497744c 2024-02-01T01:25:53.6830311Z deleted: sha256:e6994938eaeb9774b899d4af0e3ccef7ba1417a1e463fb2eeb34111c130c4892 2024-02-01T01:25:53.6831158Z deleted: sha256:9ae9e08bd0fde58c78fdb78b8bd53bee2c9bc629430dad0bddb49b8e278c4ce0 2024-02-01T01:25:53.6832019Z deleted: sha256:b742348bc3ac9bde35ca9bc3b47e3ebb0f56f686cdbe3b9bae978bbff7bb8e04 2024-02-01T01:25:53.6832860Z deleted: sha256:04bf91227ae62e90cc461bde082395b791233577a039878b25c3813643c06dfc 2024-02-01T01:25:53.6833656Z deleted: sha256:e6a5132367b471556036c12c8654c80319048040210bd5ce83ab7ee389a94529 2024-02-01T01:25:53.6834453Z deleted: sha256:aca69f1c663b41a4989ec34f20a5c1278c4804a712d5289de0b17a9663dd165c 2024-02-01T01:25:53.6835294Z deleted: sha256:661a99bd59cab0cd71f2db72b1adced764549a8b6b9776c81727eccc734e1d0f 2024-02-01T01:25:53.6836132Z deleted: sha256:6328908f21b7449ce2b0d83e2bb654339c299440602f991b6faf4c333ffca2e6 2024-02-01T01:25:53.6837007Z deleted: sha256:5adeba195d2d3a307dc7467fa9d0307a62b35e06efcf79ce62c5407f6917d27c 2024-02-01T01:25:53.6837844Z deleted: sha256:51e1f03ab3aa056ee10a080aa304a11365361d79e92d10b3fbc3386d590ad88e 2024-02-01T01:25:53.6838704Z deleted: sha256:dbdbaf34eb88aead2291d5bf7c43a0123fe9d15ed70d72fd584eaf3cd817ab34 2024-02-01T01:25:53.6839576Z deleted: sha256:254ade188523ae45b1bb0bc8064ce0fa8ed4d3a59c49698d3ba4ffe74e47c456 2024-02-01T01:25:53.6840404Z deleted: sha256:1372066c99d30e6de02aaa93ff5500349afd1fa7b44cb1196e631fb80e9acc9b 2024-02-01T01:25:53.6841237Z deleted: sha256:7474e826f80cde25faa0ce44e8ee38fcf0542d79ced0d022e72f83b160ac371b 2024-02-01T01:25:53.6842078Z deleted: sha256:3be59ed0b2b93a2af733de200fa502c98588b8984dd68372db4faf26dbbf53c3 2024-02-01T01:25:53.6842924Z deleted: sha256:7a4b8e16b24799cac6c789e2f6d703366f0b3d602de877c488c7837372280f4e 2024-02-01T01:25:53.6843745Z deleted: sha256:018c7bce039b2117e7608f0c4df06f60a8523ed065c8b10a35586200bc437fec 2024-02-01T01:25:53.6844589Z deleted: sha256:6f6188c75ddfe7b3eca5415b841bc0f314ed56824f13efad4c3bf89703e793f2 2024-02-01T01:25:53.6845624Z deleted: sha256:4182034dcf2ef7c766bd14212197fe13f009d64513ff61087d6adb9fc9a18bfc 2024-02-01T01:25:53.6846439Z deleted: sha256:d0974bd744a402c041cafe3dc7be2c6db7f97216f8b03e817f3ff321e4b26f09 2024-02-01T01:25:53.6847271Z deleted: sha256:aa39709946274ff34b09102f003e295a94f8d7d58dc310a9873cf93106ad4a8e 2024-02-01T01:25:53.6848094Z deleted: sha256:b74650ca1b0d30dbbcaa199b592cac002ed6a0b8e34235814d638d3f9baac0d4 2024-02-01T01:25:53.6848922Z deleted: sha256:e0f2d0ebaa90f9575b54ec2964c8c376a734ace79869925f005ce94584046a38 2024-02-01T01:25:53.6849840Z deleted: sha256:3610fb3d7efd05aa3db95cb27519c3c9416481c28d54a251a524daf0cf09593d 2024-02-01T01:25:53.6850759Z deleted: sha256:ea666f6710c8f3acdce6c4ff9a191121be3c8d9b34032c99499d740d4b58c6e8 2024-02-01T01:25:53.6851719Z deleted: sha256:c948f3da8b1c6f7e281e475862534875c54c6fdf420fc67c6037e29702f615f5 2024-02-01T01:25:53.6852551Z deleted: sha256:526e41e6f6432bb697b04dbd9e2b0da8711a70e8c13a148709c162063c910590 2024-02-01T01:25:53.6853391Z deleted: sha256:f0a6ad19d9bb477b908b1d3f098dca4f10bf94e6b73abd0dced079e441d4cf59 2024-02-01T01:25:53.6854294Z deleted: sha256:c0c4f0f7e025d8aaae32f99a7b2f38f84c963b4f6250472b22f56a4298edd169 2024-02-01T01:25:53.6855122Z deleted: sha256:8adf212436eb6c147db8eba127b7d0f41ce2dd4868bf8ce376de2ea0926e05ef 2024-02-01T01:25:53.6855967Z deleted: sha256:eb5b2c0ed98df39ed3de0001403f42a83f4f492dde3e6dc1667f3fcb87919bd8 2024-02-01T01:25:53.6856800Z deleted: sha256:aa109b9fb1c2f9fb9ce1820f66138015b8ab2e4d230cb93f6337010fd684a581 2024-02-01T01:25:53.6857645Z deleted: sha256:3b181910732aad95991a265035a7faf20132fda4ec19db0403c71872f4e5b4f4 2024-02-01T01:25:53.6858468Z deleted: sha256:5a4290dcdbecc1630d13655b730caa0a419ef1be615122b83a4cf107bdbba24c 2024-02-01T01:25:53.6859296Z deleted: sha256:0f5ff8083fd518d2910711e480bf4aff921d780febb0c9646c1e81137236018a 2024-02-01T01:25:53.6860201Z deleted: sha256:901392d475c89184993caca719639cc363b71755ffb89290f17a108cd0f17ab8 2024-02-01T01:25:53.6861010Z deleted: sha256:4dff2ef42c740ad6a0eed8fd741e8d494649442737634f0a41987fa18ac699fc 2024-02-01T01:25:53.6861846Z deleted: sha256:4b2bdceedc91d6275b439d421aaf3e330f337fdd80fce5c97b11517d63793111 2024-02-01T01:25:53.6862672Z deleted: sha256:9d909d9de7585e2fd91e260769ed81dca04b5559b08d189e4e21a9fece7e2975 2024-02-01T01:25:53.6863493Z deleted: sha256:8bf5ad718ded6ce82d0c3e592071cfd828bf94a0affc53433e8f18e77845ad12 2024-02-01T01:25:53.6864329Z deleted: sha256:33009a5e45ace6d3dc0af80d2c8522932421b329b4f398fd9fcc106b005a9de2 2024-02-01T01:25:53.6865148Z deleted: sha256:23a2f723987dc24771e90f6e5cd8283a1bca95f1f3a33258dc293e016e9d7bfc 2024-02-01T01:25:53.6866067Z deleted: sha256:f3ce25576527becf31eb84ca197bfcf38cda5a6829aba7eb46d86ad36a33ce56 2024-02-01T01:25:53.6866976Z deleted: sha256:37b67a4715f1dc33fb69fadddd08e643cbcbe2daab6cdfaa63e1116a023aad45 2024-02-01T01:25:53.6867836Z deleted: sha256:8228b31c5be3d2449cfbb364c582b5ad55a01c5104904bb4ae7abf84bc1507c3 2024-02-01T01:25:53.6868678Z deleted: sha256:d7528e72ede7076a3815a15990fc9bbf25cb73830ada976e1536a2d356d6c0d9 2024-02-01T01:25:53.6869516Z deleted: sha256:640c0b4b2a16a19164649a8d8ab54ec850b7e83efa01b3779bdcdb21f1322be2 2024-02-01T01:25:53.6870334Z deleted: sha256:3522f074f49ae1d0c88fb6efe59a56b18dccfbab319c41b3bf35c6131c636b2c 2024-02-01T01:25:53.6871175Z deleted: sha256:c69d336c79011a57e165dd07fdd7b9fb5c36d8446c62e7c8771f32acd734b120 2024-02-01T01:25:53.6872009Z deleted: sha256:646a0bce958d9b0506511eb3d929395bdf8e3f6413acb1a087fdde0d35729b4c 2024-02-01T01:25:53.6872828Z deleted: sha256:0f6c82988ea842e01d9129e144c5d680e40a5aee1e265e6d32f81d1be008a453 2024-02-01T01:25:53.6873662Z deleted: sha256:29905530a0ac3f9b240f93121533ee407ed22c45b464a939228acc5625531f34 2024-02-01T01:25:53.6874497Z deleted: sha256:64339972be8e52f326de0da5215cbdb725bf91cce68edad1ef12ef7ba1173999 2024-02-01T01:25:53.6875323Z deleted: sha256:35a033279d5c9bd6b862bfa9331d8fcbf98bbaf778a778fabc882384db9204f8 2024-02-01T01:25:53.6876169Z deleted: sha256:ab9b00c496073d62e69e032178858d3d6d4c4bab9e87065d3785c6da57351d00 2024-02-01T01:25:53.6876990Z deleted: sha256:7ed947299e109c2459de6e240d86f049c30f93f2380659ce441d8737b8cd065f 2024-02-01T01:25:53.6877816Z deleted: sha256:410d5a7f7a9a1cb4551c106b9cc728a9dcff598f9be231a351ce6a0a33f81e64 2024-02-01T01:25:53.6878634Z deleted: sha256:5f0021bb56efa14bb93978c01513e2a1187ba30e69bdb0546d0ef39b30873f88 2024-02-01T01:25:53.6879475Z deleted: sha256:c1601aa97eb84151c14da3aeea351201bd99144d36d66b397f6555d80245d86d 2024-02-01T01:25:53.6880321Z deleted: sha256:72af05a89be22accdc1ca5d66dcbbb33993a9ef5997f849df1d4ba4c48049f25 2024-02-01T01:25:53.6881156Z deleted: sha256:38b90c9663dcaa2bc57a4dd3008298e7ea93e9535fa42312b4dc4246e7491af9 2024-02-01T01:25:53.6882060Z deleted: sha256:bda61b6cefb3ec8eeb74fef1ca1c7f9a5845fe5e8f07b8123323e897425d5c29 2024-02-01T01:25:53.6882904Z deleted: sha256:5a18e1aa877074529a84cbddf19f8d5403787823378ceae6b72fb62f78d43037 2024-02-01T01:25:53.6883732Z deleted: sha256:6c3e7df31590f02f10cb71fc4eb27653e9b428df2e6e5421a455b062bd2e39f9 2024-02-01T01:25:53.6884240Z 2024-02-01T01:25:53.6884397Z Total reclaimed space: 24.12GB 2024-02-01T01:25:53.6917593Z ##[group]Run set +e 2024-02-01T01:25:53.6917988Z set +e 2024-02-01T01:25:53.6918291Z set -x 2024-02-01T01:25:53.6918579Z  2024-02-01T01:25:53.6918869Z nvidia-smi 2024-02-01T01:25:53.6919490Z # NB: Surprisingly, nvidia-smi command returns successfully with return code 0 even in 2024-02-01T01:25:53.6920464Z # the case where the driver has already crashed as it still can get the driver version 2024-02-01T01:25:53.6921441Z # and some basic information like the bus ID. However, the rest of the information 2024-02-01T01:25:53.6922202Z # would be missing (ERR!), for example: 2024-02-01T01:25:53.6922649Z # 2024-02-01T01:25:53.6923071Z # +-----------------------------------------------------------------------------+ 2024-02-01T01:25:53.6923843Z # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | 2024-02-01T01:25:53.6924625Z # |-------------------------------+----------------------+----------------------+ 2024-02-01T01:25:53.6925656Z # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 2024-02-01T01:25:53.6926509Z # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 2024-02-01T01:25:53.6927250Z # | | | MIG M. | 2024-02-01T01:25:53.6927849Z # |===============================+======================+======================| 2024-02-01T01:25:53.6928491Z # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | 2024-02-01T01:25:53.6929319Z # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | 2024-02-01T01:25:53.6930017Z # | | | ERR! | 2024-02-01T01:25:53.6930642Z # +-------------------------------+----------------------+----------------------+ 2024-02-01T01:25:53.6931144Z # 2024-02-01T01:25:53.6931555Z # +-----------------------------------------------------------------------------+ 2024-02-01T01:25:53.6932210Z # | Processes: | 2024-02-01T01:25:53.6932934Z # | GPU GI CI PID Type Process name GPU Memory | 2024-02-01T01:25:53.6933639Z # | ID ID Usage | 2024-02-01T01:25:53.6934252Z # |=============================================================================| 2024-02-01T01:25:53.6934865Z # +-----------------------------------------------------------------------------+ 2024-02-01T01:25:53.6935366Z # 2024-02-01T01:25:53.6935919Z # This should be reported as a failure instead as it will guarantee to fail when 2024-02-01T01:25:53.6936656Z # Docker tries to run with --gpus all 2024-02-01T01:25:53.6937095Z # 2024-02-01T01:25:53.6937632Z # So, the correct check here is to query one of the missing piece of info like 2024-02-01T01:25:53.6938408Z # GPU name, so that the command can fail accordingly 2024-02-01T01:25:53.6939104Z nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2024-02-01T01:25:53.6939674Z NVIDIA_SMI_STATUS=$? 2024-02-01T01:25:53.6940040Z  2024-02-01T01:25:53.6940660Z # These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action 2024-02-01T01:25:53.6941659Z if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then 2024-02-01T01:25:53.6942489Z  echo "NVIDIA driver installation has failed, shutting down the runner..." 2024-02-01T01:25:53.6943204Z  .github/scripts/stop_runner_service.sh 2024-02-01T01:25:53.6943667Z fi 2024-02-01T01:25:53.6943940Z  2024-02-01T01:25:53.6944628Z # For runner with multiple GPUs, we also want to confirm that the number of GPUs are the 2024-02-01T01:25:53.6945609Z # power of 2, i.e. 1, 2, 4, or 8. This is to avoid flaky test issue when one GPU fails 2024-02-01T01:25:53.6946412Z # https://github.com/pytorch/test-infra/issues/4000 2024-02-01T01:25:53.6947069Z GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) 2024-02-01T01:25:53.6947554Z NVIDIA_SMI_STATUS=$? 2024-02-01T01:25:53.6947909Z  2024-02-01T01:25:53.6948522Z # These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action 2024-02-01T01:25:53.6949452Z if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then 2024-02-01T01:25:53.6950287Z  echo "NVIDIA driver installation has failed, shutting down the runner..." 2024-02-01T01:25:53.6950991Z  .github/scripts/stop_runner_service.sh 2024-02-01T01:25:53.6951446Z fi 2024-02-01T01:25:53.6951738Z  2024-02-01T01:25:53.6952089Z # Check the GPU count to be a power of 2 2024-02-01T01:25:53.6952937Z if [ "$GPU_COUNT" -le 8 ] && [ "$GPU_COUNT" -ne 1 ] && [ "$GPU_COUNT" -ne 2 ] && [ "$GPU_COUNT" -ne 4 ] && [ "$GPU_COUNT" -ne 8 ]; then 2024-02-01T01:25:53.6954046Z  echo "NVIDIA driver detects $GPU_COUNT GPUs. The runner has a broken GPU, shutting it down..." 2024-02-01T01:25:53.6954843Z  .github/scripts/stop_runner_service.sh 2024-02-01T01:25:53.6955296Z fi 2024-02-01T01:25:53.6962997Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-02-01T01:25:53.6963508Z env: 2024-02-01T01:25:53.6963779Z GIT_DEFAULT_BRANCH: main 2024-02-01T01:25:53.6964244Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-02-01T01:25:53.6965169Z DOCKER_CONTAINER_ID: 9c1b867878419d712fe79c0f86206bccddfb17ab4de35d11084a62e84e0a9dbd 2024-02-01T01:25:53.6965957Z RUNNER_WORKSPACE: /home/ec2-user/actions-runner/_work/pytorch 2024-02-01T01:25:53.6966477Z ##[endgroup] 2024-02-01T01:25:53.6987206Z + nvidia-smi 2024-02-01T01:25:55.8660367Z Thu Feb 1 01:25:55 2024 2024-02-01T01:25:55.8661315Z +---------------------------------------------------------------------------------------+ 2024-02-01T01:25:55.8662309Z | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | 2024-02-01T01:25:55.8663157Z |-----------------------------------------+----------------------+----------------------+ 2024-02-01T01:25:55.8663943Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2024-02-01T01:25:55.8664826Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2024-02-01T01:25:55.8665507Z | | | MIG M. | 2024-02-01T01:25:55.8666035Z |=========================================+======================+======================| 2024-02-01T01:25:55.8727871Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2024-02-01T01:25:55.8728587Z | 0% 22C P0 52W / 300W | 4MiB / 23028MiB | 5% Default | 2024-02-01T01:25:55.8729206Z | | | N/A | 2024-02-01T01:25:55.8729913Z +-----------------------------------------+----------------------+----------------------+ 2024-02-01T01:25:55.8730585Z 2024-02-01T01:25:55.8731450Z +---------------------------------------------------------------------------------------+ 2024-02-01T01:25:55.8732059Z | Processes: | 2024-02-01T01:25:55.8732753Z | GPU GI CI PID Type Process name GPU Memory | 2024-02-01T01:25:55.8733433Z | ID ID Usage | 2024-02-01T01:25:55.8734142Z |=======================================================================================| 2024-02-01T01:25:55.8734790Z | No running processes found | 2024-02-01T01:25:55.8735531Z +---------------------------------------------------------------------------------------+ 2024-02-01T01:25:56.3622252Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2024-02-01T01:25:58.5091897Z NVIDIA A10G 2024-02-01T01:25:58.8205556Z + NVIDIA_SMI_STATUS=0 2024-02-01T01:25:58.8206131Z + '[' 0 -ne 0 ']' 2024-02-01T01:25:58.8209582Z ++ nvidia-smi --list-gpus 2024-02-01T01:25:58.8210095Z ++ wc -l 2024-02-01T01:26:01.2909020Z + GPU_COUNT=1 2024-02-01T01:26:01.2909505Z + NVIDIA_SMI_STATUS=0 2024-02-01T01:26:01.2910341Z + '[' 0 -ne 0 ']' 2024-02-01T01:26:01.2910693Z + '[' 1 -le 8 ']' 2024-02-01T01:26:01.2911009Z + '[' 1 -ne 1 ']' 2024-02-01T01:26:01.2967132Z Post job cleanup. 2024-02-01T01:26:01.3015027Z Post job cleanup. 2024-02-01T01:26:01.3828104Z [command]/usr/bin/git version 2024-02-01T01:26:01.3865925Z git version 2.40.1 2024-02-01T01:26:01.3904396Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e0afa28d-bc3a-48b9-8949-6fd62eb61d3c' before making global git config changes 2024-02-01T01:26:01.3905640Z Adding repository directory to the temporary git global config as a safe directory 2024-02-01T01:26:01.3909750Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-02-01T01:26:01.3944646Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2024-02-01T01:26:01.3974628Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2024-02-01T01:26:01.4174667Z Entering 'android/libs/fbjni' 2024-02-01T01:26:01.4210583Z Entering 'third_party/FP16' 2024-02-01T01:26:01.4248889Z Entering 'third_party/FXdiv' 2024-02-01T01:26:01.4285661Z Entering 'third_party/NNPACK' 2024-02-01T01:26:01.4319918Z Entering 'third_party/QNNPACK' 2024-02-01T01:26:01.4356286Z Entering 'third_party/VulkanMemoryAllocator' 2024-02-01T01:26:01.4391553Z Entering 'third_party/XNNPACK' 2024-02-01T01:26:01.4437206Z Entering 'third_party/benchmark' 2024-02-01T01:26:01.4472918Z Entering 'third_party/cpuinfo' 2024-02-01T01:26:01.4508342Z Entering 'third_party/cub' 2024-02-01T01:26:01.4544597Z Entering 'third_party/cudnn_frontend' 2024-02-01T01:26:01.4583180Z Entering 'third_party/cutlass' 2024-02-01T01:26:01.4623421Z Entering 'third_party/eigen' 2024-02-01T01:26:01.4661986Z Entering 'third_party/fbgemm' 2024-02-01T01:26:01.4697580Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-02-01T01:26:01.4733492Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-02-01T01:26:01.4769896Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-02-01T01:26:01.4806850Z Entering 'third_party/fbgemm/third_party/googletest' 2024-02-01T01:26:01.4841071Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-02-01T01:26:01.4877733Z Entering 'third_party/flatbuffers' 2024-02-01T01:26:01.4915767Z Entering 'third_party/fmt' 2024-02-01T01:26:01.4950927Z Entering 'third_party/foxi' 2024-02-01T01:26:01.4984561Z Entering 'third_party/gemmlowp/gemmlowp' 2024-02-01T01:26:01.5020110Z Entering 'third_party/gloo' 2024-02-01T01:26:01.5056207Z Entering 'third_party/googletest' 2024-02-01T01:26:01.5092394Z Entering 'third_party/ideep' 2024-02-01T01:26:01.5124598Z Entering 'third_party/ideep/mkl-dnn' 2024-02-01T01:26:01.5163537Z Entering 'third_party/ios-cmake' 2024-02-01T01:26:01.5200052Z Entering 'third_party/ittapi' 2024-02-01T01:26:01.5236155Z Entering 'third_party/kineto' 2024-02-01T01:26:01.5271790Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-02-01T01:26:01.5305153Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-02-01T01:26:01.5341517Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-02-01T01:26:01.5377145Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-02-01T01:26:01.5411251Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-02-01T01:26:01.5443858Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-02-01T01:26:01.5480912Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-02-01T01:26:01.5514874Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-02-01T01:26:01.5550546Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-02-01T01:26:01.5584480Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-02-01T01:26:01.5619358Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-02-01T01:26:01.5655926Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-02-01T01:26:01.5689327Z Entering 'third_party/mimalloc' 2024-02-01T01:26:01.5725987Z Entering 'third_party/nccl/nccl' 2024-02-01T01:26:01.5760469Z Entering 'third_party/neon2sse' 2024-02-01T01:26:01.5795402Z Entering 'third_party/nlohmann' 2024-02-01T01:26:01.5833044Z Entering 'third_party/onnx' 2024-02-01T01:26:01.5880695Z Entering 'third_party/onnx/third_party/benchmark' 2024-02-01T01:26:01.5916488Z Entering 'third_party/onnx/third_party/pybind11' 2024-02-01T01:26:01.5955074Z Entering 'third_party/onnx-tensorrt' 2024-02-01T01:26:01.5987620Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-02-01T01:26:01.6026162Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-02-01T01:26:01.6060002Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-02-01T01:26:01.6093551Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-02-01T01:26:01.6130856Z Entering 'third_party/pocketfft' 2024-02-01T01:26:01.6163725Z Entering 'third_party/protobuf' 2024-02-01T01:26:01.6199885Z Entering 'third_party/protobuf/third_party/benchmark' 2024-02-01T01:26:01.6234038Z Entering 'third_party/protobuf/third_party/googletest' 2024-02-01T01:26:01.6271319Z Entering 'third_party/psimd' 2024-02-01T01:26:01.6304699Z Entering 'third_party/pthreadpool' 2024-02-01T01:26:01.6340753Z Entering 'third_party/pybind11' 2024-02-01T01:26:01.6372637Z Entering 'third_party/python-peachpy' 2024-02-01T01:26:01.6407885Z Entering 'third_party/sleef' 2024-02-01T01:26:01.6441428Z Entering 'third_party/tbb' 2024-02-01T01:26:01.6477760Z Entering 'third_party/tensorpipe' 2024-02-01T01:26:01.6514055Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-02-01T01:26:01.6549733Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-02-01T01:26:01.6582957Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-02-01T01:26:01.6617542Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-02-01T01:26:01.6650112Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-02-01T01:26:01.6684077Z Entering 'third_party/zstd' 2024-02-01T01:26:01.6733887Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2024-02-01T01:26:01.6752521Z http.https://github.com/.extraheader 2024-02-01T01:26:01.6760253Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2024-02-01T01:26:01.6788981Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2024-02-01T01:26:01.6995461Z Entering 'android/libs/fbjni' 2024-02-01T01:26:01.7019386Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7042662Z Entering 'third_party/FP16' 2024-02-01T01:26:01.7064815Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7088842Z Entering 'third_party/FXdiv' 2024-02-01T01:26:01.7110731Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7134156Z Entering 'third_party/NNPACK' 2024-02-01T01:26:01.7154200Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7179800Z Entering 'third_party/QNNPACK' 2024-02-01T01:26:01.7201217Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7224966Z Entering 'third_party/VulkanMemoryAllocator' 2024-02-01T01:26:01.7247149Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7271579Z Entering 'third_party/XNNPACK' 2024-02-01T01:26:01.7293309Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7327888Z Entering 'third_party/benchmark' 2024-02-01T01:26:01.7349776Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7372869Z Entering 'third_party/cpuinfo' 2024-02-01T01:26:01.7394451Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7418889Z Entering 'third_party/cub' 2024-02-01T01:26:01.7440418Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7464035Z Entering 'third_party/cudnn_frontend' 2024-02-01T01:26:01.7485786Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7510121Z Entering 'third_party/cutlass' 2024-02-01T01:26:01.7530856Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7558643Z Entering 'third_party/eigen' 2024-02-01T01:26:01.7581108Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7607040Z Entering 'third_party/fbgemm' 2024-02-01T01:26:01.7629977Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7654008Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-02-01T01:26:01.7674998Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7699217Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-02-01T01:26:01.7719816Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7743466Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-02-01T01:26:01.7765375Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7792250Z Entering 'third_party/fbgemm/third_party/googletest' 2024-02-01T01:26:01.7813343Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7836492Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-02-01T01:26:01.7859154Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7883526Z Entering 'third_party/flatbuffers' 2024-02-01T01:26:01.7906275Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7933613Z Entering 'third_party/fmt' 2024-02-01T01:26:01.7953846Z http.https://github.com/.extraheader 2024-02-01T01:26:01.7979257Z Entering 'third_party/foxi' 2024-02-01T01:26:01.8001782Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8025952Z Entering 'third_party/gemmlowp/gemmlowp' 2024-02-01T01:26:01.8048630Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8072349Z Entering 'third_party/gloo' 2024-02-01T01:26:01.8094597Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8118574Z Entering 'third_party/googletest' 2024-02-01T01:26:01.8142818Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8166510Z Entering 'third_party/ideep' 2024-02-01T01:26:01.8187987Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8210877Z Entering 'third_party/ideep/mkl-dnn' 2024-02-01T01:26:01.8232079Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8259416Z Entering 'third_party/ios-cmake' 2024-02-01T01:26:01.8281623Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8304121Z Entering 'third_party/ittapi' 2024-02-01T01:26:01.8323924Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8348775Z Entering 'third_party/kineto' 2024-02-01T01:26:01.8370815Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8394158Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-02-01T01:26:01.8416350Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8440115Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-02-01T01:26:01.8462673Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8486215Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-02-01T01:26:01.8508089Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8530463Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-02-01T01:26:01.8552117Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8576384Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-02-01T01:26:01.8597892Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8622357Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-02-01T01:26:01.8643828Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8669414Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-02-01T01:26:01.8690639Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8714417Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-02-01T01:26:01.8736932Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8761178Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-02-01T01:26:01.8783868Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8810553Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-02-01T01:26:01.8831322Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8857872Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-02-01T01:26:01.8880871Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8905435Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-02-01T01:26:01.8926037Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8950801Z Entering 'third_party/mimalloc' 2024-02-01T01:26:01.8974160Z http.https://github.com/.extraheader 2024-02-01T01:26:01.8996287Z Entering 'third_party/nccl/nccl' 2024-02-01T01:26:01.9019098Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9042353Z Entering 'third_party/neon2sse' 2024-02-01T01:26:01.9062881Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9084073Z Entering 'third_party/nlohmann' 2024-02-01T01:26:01.9106064Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9132081Z Entering 'third_party/onnx' 2024-02-01T01:26:01.9154223Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9190960Z Entering 'third_party/onnx/third_party/benchmark' 2024-02-01T01:26:01.9212307Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9236810Z Entering 'third_party/onnx/third_party/pybind11' 2024-02-01T01:26:01.9259573Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9285421Z Entering 'third_party/onnx-tensorrt' 2024-02-01T01:26:01.9307514Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9332240Z Entering 'third_party/onnx-tensorrt/third_party/onnx' 2024-02-01T01:26:01.9353557Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9381792Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/benchmark' 2024-02-01T01:26:01.9402263Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9426158Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11' 2024-02-01T01:26:01.9447714Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9469793Z Entering 'third_party/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' 2024-02-01T01:26:01.9492008Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9520540Z Entering 'third_party/pocketfft' 2024-02-01T01:26:01.9543621Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9566075Z Entering 'third_party/protobuf' 2024-02-01T01:26:01.9587577Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9612822Z Entering 'third_party/protobuf/third_party/benchmark' 2024-02-01T01:26:01.9634306Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9658560Z Entering 'third_party/protobuf/third_party/googletest' 2024-02-01T01:26:01.9678207Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9703626Z Entering 'third_party/psimd' 2024-02-01T01:26:01.9724113Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9746387Z Entering 'third_party/pthreadpool' 2024-02-01T01:26:01.9768664Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9792181Z Entering 'third_party/pybind11' 2024-02-01T01:26:01.9812433Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9835017Z Entering 'third_party/python-peachpy' 2024-02-01T01:26:01.9857677Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9880927Z Entering 'third_party/sleef' 2024-02-01T01:26:01.9902981Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9927018Z Entering 'third_party/tbb' 2024-02-01T01:26:01.9949239Z http.https://github.com/.extraheader 2024-02-01T01:26:01.9973968Z Entering 'third_party/tensorpipe' 2024-02-01T01:26:01.9995502Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0018301Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-02-01T01:26:02.0039482Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0062919Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-02-01T01:26:02.0083004Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0108866Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-02-01T01:26:02.0130112Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0153631Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-02-01T01:26:02.0175137Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0196493Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-02-01T01:26:02.0217579Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0243987Z Entering 'third_party/zstd' 2024-02-01T01:26:02.0266536Z http.https://github.com/.extraheader 2024-02-01T01:26:02.0512809Z Cleaning up orphan processes