{"id":192085,"date":"2025-10-05T22:53:28","date_gmt":"2025-10-05T22:53:28","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/192085\/"},"modified":"2025-10-05T22:53:28","modified_gmt":"2025-10-05T22:53:28","slug":"introducing-managed-accounting-for-aws-parallel-computing-service","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/192085\/","title":{"rendered":"Introducing managed accounting for AWS Parallel Computing Service"},"content":{"rendered":"\n<p>This post was contributed by Ramin Torabi and Tarun Mathur from AWS, and Nick Ihli from SchedMD.\u00a0<\/p>\n<p>AWS Parallel Computing Service (PCS) is a managed service that makes it easier for you to run and scale your high performance computing (HPC) workloads on AWS using Slurm. Organizations running HPC clusters want to monitor resource utilization, enforce resource limits, and manage access-control to specific capacity across users and projects. They want to understand \u201cwho did what\u201d in their cluster for leadership reporting, capacity planning, and budgeting purposes. PCS now supports accounting, a Slurm feature that enables these activities in a cluster. PCS manages the accounting database for the cluster, so that you don\u2019t have to setup and manage a separate accounting database.<\/p>\n<p>In this post, we\u2019ll show you how this works, and point you to some actual use cases you can try yourself.<\/p>\n<p>       Setup <\/p>\n<p>Follow these steps to enable accounting in PCS:<\/p>\n<p>        <a href=\"https:\/\/docs.aws.amazon.com\/pcs\/latest\/userguide\/getting-started_create-cluster.html\" rel=\"nofollow noopener\" target=\"_blank\">Create a new PCS cluster<\/a> with Slurm 24.11 or later, opt in to the accounting feature, and configure <a href=\"https:\/\/docs.aws.amazon.com\/pcs\/latest\/userguide\/slurm-accounting.html\" rel=\"nofollow noopener\" target=\"_blank\">optional accounting parameters<\/a> as shown in Figure-1.<br \/>\n        Once the cluster status is Active, verify accounting is enabled and review configured parameters in the cluster details console page.<br \/>\n        Configure and connect to a login node <a href=\"https:\/\/docs.aws.amazon.com\/pcs\/latest\/userguide\/working-with_login-nodes.html\" rel=\"nofollow noopener\" target=\"_blank\">as described here<\/a> and perform accounting commands using the root account. <\/p>\n<p>        <img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5044\" class=\"size-full wp-image-5044\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-1.png\" alt=\"Figure 1 \u2013 The \u201cCreate Cluster\u201d console experience where you can enable and configure accounting.\" width=\"748\" height=\"803\"\/> <\/p>\n<p id=\"caption-attachment-5044\" class=\"wp-caption-text\">Figure 1 \u2013 The \u201cCreate Cluster\u201d console experience where you can enable and configure accounting.<\/p>\n<p>       Use Case 1: Attribute Usage to Projects <\/p>\n<p>Organizations want visibility into resource usage for each project or department so they can internally chargeback relevant cost centers. To do so, they need to track and attribute usage at different levels. One scenario in Figure-2:<\/p>\n<p>        Create 3 users, create accounts proj_physics and proj_chemistry, and add users to accounts with the sacctmgr function. An \u2018account\u2019 is an organizational unit used to group and manage users.<br \/>\n        Each user submits a single job attributed to the proj_physics account with the \u2013account=  flag.<br \/>\n        Validate that those jobs are attributed to the correct account proj_physics by looking up accounting data with the sacctfunction.<br \/>\n        Validate that user1 is a member of both proj_physics and proj_chemistry accounts. <\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5062\" loading=\"lazy\" class=\"size-full wp-image-5062\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-2-v2.png\" alt=\"Figure 2 \u2013 Example workflow of project attribution.\" width=\"1024\" height=\"1924\"\/> <\/p>\n<p id=\"caption-attachment-5062\" class=\"wp-caption-text\">Figure 2 \u2013 Example workflow of project attribution.<\/p>\n<p>       Use Case 2: Enforce Limits <\/p>\n<p>Organizations want to set constraints on particular users or projects so one party isn\u2019t hoarding resources. One scenario:<\/p>\n<p>        Set a limit of 6000 CPU minutes (100 CPU hours) for a particular user with the command sacctmgr modify user username set GrpTRESRunMins=cpu=6000<br \/>\n        Run the validation in Figure-3 to check the limit was set correctly.<br \/>\n        Let\u2019s assume the user has already used 95 CPU hours, and then tries to submit a job that would exceed their quota: sbatch &#8211;cpus-per-task=10 &#8211;time=1:00:00 myjob.sh. This job requests 10 CPUs for 1 hour, which would use 10 CPU hours, exceeding the remaining 5 CPU hours in the user\u2019s quota.<br \/>\n        The job submission will fail and the user will see the error in Figure-4.<br \/>\n        The user can then submit a smaller job of say 4 CPU hours which would be accepted as it fits within the remaining quota: sbatch &#8211;cpus-per-task=2 &#8211;time=2:00:00 smalljob.sh <\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5049\" loading=\"lazy\" class=\"size-full wp-image-5049\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-3.png\" alt=\"Figure 3 \u2013 Example check to identify a limit was set correctly.\" width=\"465\" height=\"68\"\/> <\/p>\n<p id=\"caption-attachment-5049\" class=\"wp-caption-text\">Figure 3 \u2013 Example check to identify a limit was set correctly.<\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5047\" loading=\"lazy\" class=\"size-full wp-image-5047\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-4.png\" alt=\"Figure 4 \u2013 Example output of an error due to a job submission that exceeds user limit.\" width=\"865\" height=\"34\"\/> <\/p>\n<p id=\"caption-attachment-5047\" class=\"wp-caption-text\">Figure 4 \u2013 Example output of an error due to a job submission that exceeds user limit.<\/p>\n<p>       Use Case 3: Generate Usage Reports <\/p>\n<p>Organizations want usage report summaries to assess their resource utilization and plan future capacity allocations. One scenario:<\/p>\n<p>        Query all jobs run in your cluster in the past week with the command<br \/>sacct &#8211;starttime=$(date -d &#8220;7 days ago&#8221; +%Y-%m-%d) &#8211;format=&#8221;JobID,User,JobName,Partition,Account,AllocCPUS,State,ExitCode&#8221;<br \/>\n        The example output in Figure-5 shows the unique JobID of each job submission, which user submitted it, which partition (queue) the job was submitted from, how many CPUs were used to run that job, and the status of that job. Analyze that data to identify broader trends in your cluster. Note that most submitted jobs were completed, yet job 1236 failed, job 1238 got cancelled, and job 1240 is a large running job with 16 allocated CPUs.<br \/>\n        Query utilization over a single month with the command<br \/>sreport cluster AccountUtilizationByUser start=2025-04-01 end=2025-04-30 -t percent format=&#8221;Accounts,Login,Proper,Used&#8221;<br \/>\n        The example output in Figure-6 shows cluster utilization by project and user. Analyze these trends to identify adjustments to your allocation strategy across users and accounts. Identify whether it is fair that project_a01 and particularly user1 are hoarding 42% of resources over the month.<br \/>\n        Query top users over the prior month with the command sreport user topusage start=2025-03-01 end=2025-03-31<br \/>\n        The example output in Figure-7 lists the top users by CPU minutes in the cluster. Note that user1 continues to be the largest user in the prior month. <\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5045\" loading=\"lazy\" class=\"size-full wp-image-5045\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-5.png\" alt=\"Figure 5 \u2013 Weekly Jobs report using sacct command.\" width=\"582\" height=\"219\"\/> <\/p>\n<p id=\"caption-attachment-5045\" class=\"wp-caption-text\">Figure 5 \u2013 Weekly Jobs report using sacct command.<\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5050\" loading=\"lazy\" class=\"size-full wp-image-5050\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-6.png\" alt=\"Figure 6 \u2013 Monthly cluster utilization report using sreport command.\" width=\"550\" height=\"299\"\/> <\/p>\n<p id=\"caption-attachment-5050\" class=\"wp-caption-text\">Figure 6 \u2013 Monthly cluster utilization report using sreport command.<\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5043\" loading=\"lazy\" class=\"size-full wp-image-5043\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-7.png\" alt=\"Figure 7 \u2013 Monthly top users report using sreport command.\" width=\"545\" height=\"230\"\/> <\/p>\n<p id=\"caption-attachment-5043\" class=\"wp-caption-text\">Figure 7 \u2013 Monthly top users report using sreport command.<\/p>\n<p>       Use Case 4: Identify Job Issues <\/p>\n<p>Individual users want usage report summaries to identify and remediate job failures. One scenario:<\/p>\n<p>        User checks their failed jobs in the past week with the command<br \/>sacct -u username &#8211;starttime=$(date -d &#8220;7 days ago&#8221; +%Y-%m-%d) &#8211;format=&#8221;JobID,JobName,State,ExitCode,Start,End,MaxRSS,MaxVMSize,Comment&#8221;<br \/>\n        The example output in Figure-8 helps the user identify that two jobs have failed due to high memory usage, and the third job succeeded as it was more memory efficient (2800MB used).<br \/>\n        The user reviews the job scripts for both cnn_test and bert_run and identifies the root cause \u2013 the scripts are not requesting enough memory. The user can then evaluate whether they want to modify the scripts to request sufficient memory and re-submit those two jobs. <\/p>\n<p>        <img decoding=\"async\" aria-describedby=\"caption-attachment-5046\" loading=\"lazy\" class=\"size-full wp-image-5046\" src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2025\/10\/HPCBlog-365-fig-8.png\" alt=\"Figure 8 \u2013 Weekly job failures report for a particular user using sacct command.\" width=\"815\" height=\"88\"\/> <\/p>\n<p id=\"caption-attachment-5046\" class=\"wp-caption-text\">Figure 8 \u2013 Weekly job failures report for a particular user using sacct command.<\/p>\n<p>For more accounting use cases see SchedMD documentation for the <a href=\"https:\/\/slurm.schedmd.com\/sacctmgr.html\" rel=\"nofollow noopener\" target=\"_blank\">sacctmgr<\/a>, <a href=\"https:\/\/slurm.schedmd.com\/sacct.html\" rel=\"nofollow noopener\" target=\"_blank\">sacct<\/a>, and <a href=\"https:\/\/slurm.schedmd.com\/sreport.html\" rel=\"nofollow noopener\" target=\"_blank\">sreport<\/a> commands.<\/p>\n<p>       Pricing <\/p>\n<p>Enabling accounting will incur two additional charges \u2013 an hourly accounting usage fee that varies by cluster controller size chosen, and an accounting storage fee that is billed in per GB-month increments. The accounting usage fee is billed while accounting is enabled on your cluster, and the accounting storage fee scales based on the number of accounting records stored (configurable via the Default Purge Time parameter explained in the <a href=\"https:\/\/docs.aws.amazon.com\/pcs\/latest\/userguide\/slurm-accounting.html\" rel=\"nofollow noopener\" target=\"_blank\">documentation<\/a>). More details are on the <a href=\"https:\/\/aws.amazon.com\/pcs\/pricing\/\" rel=\"nofollow noopener\" target=\"_blank\">PCS pricing page<\/a>.<\/p>\n<p>       Conclusion <\/p>\n<p>In this post, we showed how you can leverage the managed accounting feature in AWS Parallel Computing Service to monitor cluster usage on your cluster. Get started today by visiting the <a href=\"https:\/\/console.aws.amazon.com\/pcs\/home\" rel=\"nofollow noopener\" target=\"_blank\">AWS PCS management console<\/a>. Let us know what you think!<\/p>\n","protected":false},"excerpt":{"rendered":"This post was contributed by Ramin Torabi and Tarun Mathur from AWS, and Nick Ihli from SchedMD.\u00a0 AWS&hellip;\n","protected":false},"author":2,"featured_media":192086,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[49,48,285,61],"class_list":{"0":"post-192085","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-ca","9":"tag-canada","10":"tag-computing","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/192085","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=192085"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/192085\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/192086"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=192085"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=192085"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=192085"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}