Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor usage of environment variables for DAG configuration to use Airflow Variables #4863

Open
11 tasks
sarayourfriend opened this issue Sep 5, 2024 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow

Comments

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Sep 5, 2024

Current Situation

There are a number of configuration settings in our DAG and DAG-support code that read exclusively using os.getenv, and as a result, are configured as environment variables in the service (rather than Metastore variables, which are more flexibly managed and changed).

This prevents us from making changes to these variables without changing the environment variables, which currently requires a restart of the Airflow webserver and scheduler. It also has implications for future remote executor usage, where environment variables available to the scheduler are not shared to the remote execution environments, and instead, the metastore should be relied on to retrieve variables.

To clarify: this is not an issue about using environment variables to configure Airflow Variables, it is an issue about the retrieval of those variables at DAG runtime. How to actually configure Airflow Variables is a separate question, but doesn't have any bearing on whether DAGs read variables at runtime from the execution environment or from the metastore. Of course, that comes with a big caveat: if we don't use the execution environment variables to configure then, then they cannot be read from the execution environment. Therefore, reading them from the environment significantly hinders the flexibility we have when choosing how and where to manage variables.

Suggested Improvement

Refactor usage of the following variables such that they can be retrieved using Variable.get. In order to prevent incurring the performance penalty that comes with Variable.get, most of these will require refactoring DAG code to read variables at execution time (either through templates or Variable.get inside of tasks).

This list was built by searching for getenv and os.environ in the DAG code:

  • CONTACT_EMAIL
  • DEFAULT_RETRY_COUNT
  • DATA_REFRESH_POKE_INTERVAL
  • DATA_REFRESH_POOL
  • AWS_CLOUDWATCH_CONN_ID
  • AWS_RDS_CONN_ID
  • ES_INDEX_READINESS_RECORD_COUNT
  • CANONICAL_DOMAIN
  • OUTPUT_DIR
  • DJANGO_ADMIN_URL
  • OPENVERSE_BUCKET

I have ignored places where we read environment variables in unit testing code, as in those instances, we're relying exclusively on the development environment's method of Airflow Variable management being environment variables. I think we can safely assume we will continue to use environment variables to configure Airflow Variables for the local development environment, and that therefore unit testing code can continue to read them that way.

I've also ignored anything in retired because we can just remove that code altogether, as discussed in #4240. I've assigned the issue to myself and will go ahead and open a PR for it later today (and leave an update here after that's done).

Some of these might be easier to address than others. Many of them might be best addressed by turning usage of them into templates (e.g., CONTACT_EMAIL, DATA_REFRESH_POKE_INTERVAL and other things passed as arguments to @dag, @task or invocations of them). Others might be easier to just move into tasks directly with Variable.get calls. Others yet may require more complex considerations, especially if the current method and usage is relied upon to enable some approach to unit testing.

This should not be attempted in a single PR! I didn't have time when writing this issue to break it down into what I thought could be accomplished in smaller PRs. When picking up any one of these variables, I recommend doing a search of the codebase and seeing how many and what types of usages of the variable there are. That might help guide whether a given variable can be addressed in the same PR as another.

Benefit

Two main benefits:

  • Ensure the greatest degree of flexibility with respect to how DAG runtime configuration variables are managed in any given environment
  • Remove examples in contradiction of Airflow best practices/that do not rely on the tools Airflow gives us that enable advanced Airflow usage like remote executors.

Additional context

Related to/came out of this discussion: #4833 (comment)

ccing the @WordPress/openverse-catalog for visibility as the main Airflow knowledge holders 🙂 No action item here for y'all, though, other than any helpful input you might have to guide this work... or if you disagree with the premise/goal.

@sarayourfriend sarayourfriend added 🟩 priority: low Low priority and doesn't need to be rushed 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🔧 tech: airflow Involves Apache Airflow 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 5, 2024
@AetherUnbound
Copy link
Contributor

Thanks for laying this all out here, I'll respond when I have time with my thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow
2 participants