GreenScheduler: Coordinated Two-Tier Energy Optimization for Disaggregated LLM Serving

Authors

  • Waled Milad Abulgasem Alashheb The higher Institute of Science and Technology- Tripoli Author
  • Mabruka Khlifa Ali Karkeb The higher Institute of Science and Technology- Souq Aljuma Author
  • Sabria AbdulGader Ali Elmusrati higher Institute of Science and Technology- Tripoli Author
  • Sumia Abdussalam Milad Elagtel The higher Institute of Science and Technology- Souq Aljuma Author

DOI:

https://doi.org/10.65405/cn221159

Abstract

Large Language Model (LLM) inference has become a dominant consumer of en- ergy in modern AI data centers, often accounting for over 90% of total operational power [1].Recent architectural shifts toward prefill/decode disaggregation have improved perfor- mance but created complex energy optimization challenges. This paper introduces Green- Scheduler, a novel two-tier framework designed to jointly optimize GPU placement and Dynamic Voltage and Frequency Scaling (DVFS) in disaggregated environments. Tier 1 performs coarse-grained (minute-scale) phase-aware provisioning using predictive work- load modeling, while Tier 2 executes fine-grained (millisecond-scale) frequency control. For the compute-bound prefill stage, GreenScheduler employs Model Predictive Control (MPC) to manage queue dynamics; for the memory-bound decode stage, it utilizes a lightweight slack-aware adaptation mechanism. Evaluations using production Azure traces

[2] on an H100 cluster demonstrate that GreenScheduler achieves significant energy reduc- tion in both decode and prefill pools compared to performance-optimized baselines like DistServe [3], while strictly maintaining Time to First Token (TTFT) and Time Per Output Token (TPOT) Service-Level Objectives (SLOs).

Downloads

Download data is not yet available.

References

[1] Chenxu Niu, Hao Zhang, et al. Tokenpowerbench: Benchmarking the power consumption of LLM inference, 2025.

[2] Microsoft Azure. Azure LLM inference trace dataset 2024, 2024. Accessed: 2026-04-07.

[3] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, et al. Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024.

[4] OpenAI. GPT-4 technical report, 2023.

[5] Anthropic. Claude 2 model card, 2023. Accessed: 2026-04-07.

[6] Woosuk Kwon et al. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), 2023.

[7] M. Khan et al. Measuring the energy footprint of LLM inference, 2025.

[8] Y. Chung et al. Where do the joules go? diagnosing inference energy consumption, 2026.

[1] H. Zhang et al. P/d-serve: Serving disaggregated large language model at scale, 2024.

[2] Y. Zhang et al. Shuffleinfer: Disaggregate LLM inference for mixed downstream work- loads. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2024.

[3] Z. Li et al. Windserve: Efficient phase-disaggregated LLM serving with stream-based dynamic scheduling. Proceedings of the ACM, 2024.

[4] Y. Xu et al. Kvdirect: Distributed disaggregated LLM inference, 2025.

[5] Z. Wang et al. Dynaserve: Unified and elastic execution for dynamic disaggregated LLM serving, 2025.

[6] Parth Patel, Esha Choukse, et al. Splitwise: Efficient generative llm inference using phase splitting, 2024.

[7] NVIDIA. TensorRT-LLM: A tensorrt toolbox for optimized large language model infer- ence, 2024. Accessed: 2026-04-07.

[8] Y. Zhang et al. vattention: Dynamic memory management for serving LLMs without pagedattention. 2024.

[9] A. Agrawal, A. Kedia, et al. Taming throughput-latency tradeoff in llm inference with sarathi-serve. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24), 2024.

[10] Y. Jiang et al. Decoupling GPGPU voltage-frequency scaling for deep-learning applica- tions. Journal of Parallel and Distributed Computing, 2022.

[11] D. Soudris et al. Energy efficient GPU frequency scaling policy for inference serving using queue model. In IEEE Conference Proceedings, 2024.

[12] Kuan-Hsun Chen et al. Reducing compute waste in LLMs through kernel-level DVFS, 2026.

[13] Q. Liu et al. Greenllm: SLO-aware dynamic frequency scaling for energy-efficient LLM serving, 2025.

[14] A. K. Kakolyris et al. throttll’em: Predictive GPU throttling for energy efficient LLM inference serving, 2024.

[15] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024.

[16] James B. Rawlings, David Q. Mayne, and Moritz Diehl. Model Predictive Control: The- ory, Computation, and Design. Nob Hill Publishing, 2 edition, 2020.

[17] Y. Zhang et al. A survey on inference engines for large language models: Perspectives on optimization and efficiency, 2024.

[18] T. Hedgebeth et al. Part-time power measurements: nvidia-smi’s lack of attention, 2023.

[19] NVIDIA. NVIDIA h100 tensor core GPU architecture (hopper) whitepaper, 2022. Ac- cessed: 2026-04-07.

[20] Qunyou Liu, Darong Huang, Marina Zapater, and David Atienza. Greenllm: Slo- aware dynamic frequency scaling for energy-efficient llm serving. arXiv preprint arXiv:2508.16449, 2025.

[21] Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, and Dimitrios Soudris. Slo-aware gpu frequency scaling for energy efficient llm inference serving. arXiv preprint arXiv:2408.05235, 2024.

[22] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ´In˜igo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase split- ting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024.

[23] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on oper- ating systems design and implementation (OSDI 24), pages 117–134, 2024.

Downloads

Published

2026-03-01

How to Cite

GreenScheduler: Coordinated Two-Tier Energy Optimization for Disaggregated LLM Serving. (2026). Comprehensive Journal of Science, 10(39), 3804-3811. https://doi.org/10.65405/cn221159

Most read articles by the same author(s)