oracle/opengrok

wanted: PDF analyzer (Bugzilla #17751)

Open

#492 opened on May 22, 2013

View on GitHub
 (12 comments) (0 reactions) (1 assignee)Java (4,060 stars) (747 forks)batch import
enhancementhelp wantedindexer

Description

status ACCEPTED severity enhancement in component analyzer for --- Reported in version unspecified on platform ANY/Generic Assigned to: Lubos Kosco

On 2011-01-20 10:12:05 +0000, Vladimir Kotal wrote:

PDF analyzer would be beneficial to have, e.g. in order to search design documents together with source code (by selecting a project with the source code and a "project" with design documents).

On 2011-02-15 13:54:44 +0000, Lubos Kosco wrote:

we could reuse http://pdfbox.apache.org/

after all old opengrok - arcs - still used for psarcs had it like that ...

forwardport? :-D

On 2011-02-15 13:59:43 +0000, Lubos Kosco wrote:

alternatively is to use pdfbox underneath tika and grant a myriad of supported formats for lucene:

http://tika.apache.org/0.8/formats.html

(pdf, (open)office, mbox, rtf, audio/video metadata alt. java class and jar parser, it also has a compressed files parser, which can be used to satisfy bug 343 )

I have a feeling this might be one of the major features for next version! :)

On 2011-03-15 07:29:16 +0000, Lubos Kosco wrote:

for odf formats we also have: http://odftoolkit.org/

Contributor guide