Measuring CO2 impact of GCP Dataflow jobs for training models

### Prize category

Best Content

### Overview

Applying the Impact Framework to measure carbon emissions resulting from our use of GCP dataflow pipelines to train models.
The final goal is write a whitepaper, laying out the steps and methodology we followed to measure the environmental impact of us training models on a cloud platform like Google Cloud using currently available resources and documentation.

Since we are working in a virtualized cloud environment, most energy usage monitoring libraries/tools are not accessible for us. Furthermore, we rather not integrate third party plugins in our environment due to privacy and security concerns. Therefore we will try to establish a proxy method to measure the electricity consumption and use carbon intensity data to compute the total carbon cost of training our models in GCP Dataflow.

### Questions to be answered

_No response_

### Have you got a project team yet?

Yes and we aren't recruiting

### Project team

@wobniarin
@florianscheidl

### Terms of Participation

- [X] I agree to the hackathon [Rules & Terms](https://github.com/Green-Software-Foundation/hack/wiki/Rules-and-Terms) and [Code of Conduct](https://github.com/Green-Software-Foundation/hack/wiki/Code-of-Conduct)

---

## Project Submission
### Summary

Everybody talks about the carbon footprint of AI while training larger and larger models. Furthermore, very few people actually know how large their carbon footprint is. This has also been the case for us. We base our training efforts on GCP and, while being aware of the importance of tracking emissions, we did not know how to get started and how challenging it could be. 
Our project creates user guidelines on how to get started with calculating CO2 emissions of jobs running on GCP dataflow, without the need of adding third-party libraries into the code base. 

### Problem

Currently Google Cloud doesn’t provide a granular way of tracking carbon emissions. Emissions are aggregated by region and services. We are not able to evaluate the impact of our training pipeline using Google Cloud Dataflow nor get metrics on a single job run. Furthermore, it is not possible to integrate third party libraries measuring energy consumption in code in a virtual environment which limits the measurability of such environments.

Therefore the carbon impact of our model training currently remains a blind spot for us.

### Application

Our solution is based  on evaluating the carbon footprint of a Google Cloud Platform dataflow job, in this case our training pipeline. We defined a common methodology so that any dataflow job can be evaluated by looking at the information provided by Google Cloud in terms of CPU usage, number of workers or instances and duration of the job. With this information together  with some proxy data to account for the electricity consumption of running a cloud job, we can obtain the total electricity consumption of the training pipeline and therefore, by using granular carbon data we can translate this into carbon emissions. Our methodology and case study are generic enough so that anyone running cloud jobs on GCP using dataflow can use it and start understanding the carbon footprint of their jobs. 

### Prize category

Best content

### Judging Criteria

Our solution defines a first step towards raising awareness of the environmental impact of cloud jobs run on GCP and establishes a methodology that everybody can understand and implement by looking at the data after the cloud job has run. It can be difficult for non-IT people who use cloud jobs to start evaluating the environmental impacts of their jobs. With our submission we want to make this as transparent and approachable as possible. 

There are already some libraries that can be added and some websites that tell you approximately the carbon footprint of your cloud job. However, none of them are very clear on the methodology that is being used. Our solution defines a methodology that can be understood by technical and non-technical people. At the same time, our methodology is applied to make  sure that the defined methodology can be implemented. 

### Video

[Our video submission](https://www.youtube.com/watch?v=BmPes1JRnCA)

### Artefacts

[CarbonHack 2024 - How to measure the carbon footprint of your DataFlow jobs.pdf](https://github.com/Green-Software-Foundation/hack/files/14908880/CarbonHack.2024.-.How.to.measure.the.carbon.footprint.of.your.DataFlow.jobs.pdf)


### Process

We combined several data sources to estimate first the consumption of our computations and then our associated emissions. Our first source, Dataflow allows us to get minute by minute information about the number of instances used and their utilization percentage. We then used [Datavizta](https://datavizta.boavizta.org/cloudimpact) from Boavizta to estimate consumption profile of a single instance before finally integrating this with the carbon intensity data from ElectricityMaps to measure the final impact of a Dataflow job in terms of carbon emissions.
![Carbon Hack(2)](https://github.com/Green-Software-Foundation/hack/assets/38283096/28b46235-677b-43db-a509-8c2eba940c3f)


### Inspiration

As part of our jobs we collect electricity data worldwide and transform it into carbon emissions and electricity insights and forecasts so that end-users can use our data to take action and lower their carbon footprint. We do that because we believe that information precedes action and by making this data available we can push decarbonization forward. 
However, when using software we ran into the same questions over and over again that inspired us to join the CarbonHack: What are the CO2 emissions that we are responsible for by running our software business? Are they negligible? Shouldn’t we lead by example and start assessing the CO2 emissions? 

### Challenges

It’s nearly impossible to directly measure the electricity consumption of a virtual server in the cloud. There’s limited information available on the consumption of google cloud instances. 
Some websites provide static proxies to estimate the carbon footprint of cloud computing but we wanted something more precise taking into account the dynamic nature of Dataflow.

### Accomplishments

We are very happy to be able to take a first step towards making the invisible impacts of cloud jobs visible. With our submission we managed to understand how difficult it is to evaluate the carbon footprint of cloud jobs but at the same time we have been able to provide a first approximation of it. With these values it is possible to raise awareness about the carbon footprint we are responsible for and start lowering it whenever possible. 

### Learnings

1. Measuring electricity consumption in a virtualized cloud environment is way harder than we expected. If you do not operate the hardware on which your computations run, it is very hard to get such information, you will rely on a lot of proxies.
2. It’s fascinating to dive into how the compute resources we use daily in our cloud environments are generated, the cloud providers and the users have an immense responsibility to minimize their impact on the planet.
3. There are immediate simple steps we can take to reduce our carbon footprint.

### What's next?

Our methodology can directly be integrated with the Impact Framework to measure any impact related to the use of Google Cloud Dataflow. With some adjustments it could also be use to assess the impact of other Google Cloud services, extending therefore the coverage of GCP in the impact framework. To improve the accuracy, we should work on adding the GCP consumption profiles on Datavizta, which should be possible using their methodology and the publicly available information for the compute resources of GCP.

On the long term we hope that this methodology contributes to a better understanding of the impacts of large scale computations services.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measuring CO2 impact of GCP Dataflow jobs for training models #141

Prize category

Overview

Questions to be answered

Have you got a project team yet?

Project team

Terms of Participation

Project Submission

Summary

Problem

Application

Prize category

Judging Criteria

Video

Artefacts

Process

Inspiration

Challenges

Accomplishments

Learnings

What's next?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Measuring CO2 impact of GCP Dataflow jobs for training models #141

Description

Prize category

Overview

Questions to be answered

Have you got a project team yet?

Project team

Terms of Participation

Project Submission

Summary

Problem

Application

Prize category

Judging Criteria

Video

Artefacts

Process

Inspiration

Challenges

Accomplishments

Learnings

What's next?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions