wanted: PDF analyzer (Bugzilla #17751) · oracle/opengrok#492

(12 comments) (0 reactions) (1 assignee)Java (747 forks)batch import

enhancementhelp wantedindexer

Repository metrics

Stars: (4,060 stars)
PR merge metrics: (Avg merge 2d 17h) (11 merged PRs in 30d)

Description

status ACCEPTED severity enhancement in component analyzer for --- Reported in version unspecified on platform ANY/Generic Assigned to: Lubos Kosco

On 2011-01-20 10:12:05 +0000, Vladimir Kotal wrote:

PDF analyzer would be beneficial to have, e.g. in order to search design documents together with source code (by selecting a project with the source code and a "project" with design documents).

On 2011-02-15 13:54:44 +0000, Lubos Kosco wrote:

we could reuse http://pdfbox.apache.org/

after all old opengrok - arcs - still used for psarcs had it like that ...

forwardport? :-D

On 2011-02-15 13:59:43 +0000, Lubos Kosco wrote:

alternatively is to use pdfbox underneath tika and grant a myriad of supported formats for lucene:

http://tika.apache.org/0.8/formats.html

(pdf, (open)office, mbox, rtf, audio/video metadata alt. java class and jar parser, it also has a compressed files parser, which can be used to satisfy bug 343 )

I have a feeling this might be one of the major features for next version! :)

On 2011-03-15 07:29:16 +0000, Lubos Kosco wrote:

for odf formats we also have: http://odftoolkit.org/

Contributor guide

Research direction: Investigate integrating a PDF parser (e.g., Apache Tika or PDFBox) into OpenGrok's analyzer framework to support PDF file indexing and search.
Tech stack: java
Domain: backend
Issue type: Feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: Stale
Clarity: Clear
Prerequisites: JavaGit
Newbie friendliness: 50

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.