apache/seatunnel
View on GitHubSupport parse pdf to structured data (Parser + Normalization).
Open
#9716 opened on Aug 18, 2025
2 comments (2 comments)0 reactions (0 reactions)1 assignee (1 assignee)Java6,897 stars (6,897 stars)1,432 forks (1,432 forks)batch import
help wanted
Description
This issue does not include a description.
Contributor guide
- Tech stack
- java
- Domain
- backenddata
- Issue type
- feature
- DifficultyEstimated implementation difficulty for a new contributor, from 1 for very small changes to 5 for expert-level work.
- 3
- Estimated timeA rough time range for an experienced contributor to investigate, implement, test, and prepare a pull request.
- 3-5 days
- Activity statusHow available the issue appears right now: fresh, active, stale, blocked, or waiting on maintainer input.
- blocked
- ClarityHow clearly the issue explains the expected change, acceptance criteria, and next step.
- needs investigation
- Prerequisites
- Java knowledgeUnderstanding of PDF parsingFamiliarity with SeaTunnel architectureData transformation basics
- Newbie friendlinessA 1-100 score estimating how approachable this issue is for first-time contributors.
- 45
- Research direction
- To implement this feature, start by examining existing SeaTunnel source connectors (e.g., in the `seatunnel connectors` module) to understand the pattern for reading and normalizing data. Consider using Apache PDFBox for PDF parsing and map extracted content to a structured format (e.g., JSON or Avro). Review the issue comments and any linked discussions for additional context, and coordinate with the assignee to avoid duplication.