[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs · apache/kyuubi#6832

(3 留言) (1 反應) (0 負責人)Scala (2,332 star) (996 fork)batch import

help wantedkind:featurepriority:major

描述

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the feature

Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.

Motivation

For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.

Describe the solution

the usage might be like

$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
> SELECT
    app_id, app_attempt_id,
    app_start_time, app_end_time,
    container_id, host,
    file_name, line_num, message
  FROM yarn.agg_logs
  WHERE app_id = 'application_1234'
    AND container_id='container_12345'
    AND host = 'hadoop123.example.com'

Additional context

No response

Are you willing to submit PR?

Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.

貢獻者指南

技術棧: scalajavasql
領域: databackend
議題類型: feature
難度: 5
預計時間: over 1 week
活動狀態: fresh
清晰度: mostly clear
前置要求: Spark SQLYARN architectureScala/JavaHDFS
新手友善度: 15
研究方向: Review existing Spark DSv2 connectors (e.g., for Kafka, JDBC) to understand the plugin pattern. Study YARN's log aggregation mechanism and how logs are stored on HDFS. Look at Kyuubi's existing connector implementations in the repository. The issue comments may contain additional guidance; check discussion on requirements. Start by defining the YarnCatalog and implementing reading of YARN aggregation logs using Spark's DataSourceV2 API.