A small summary to the search engines Nutch and Red Piranha. The analysis was accomplished regarding to the system (Lucene eLecture) which should be developed.
Nutch 0.7-dev
- Front-end laid out for root => wrong links
- no support of PPT and PDF (momentary: HTML, MP3, MSWord, rtf, Plain text)
- PlugIn concept
- The storage of the necessary complementary data as (page number, pagetitles, etc.) could be achieved by changing the PlugIn
- PDF PlugIn would shortly be available for Nutch
- One would have to provide a PPT (and ev. PDF Plugin)
- Later then a Flash and a Lecturnity PlugIn
- Through activating listings on the file server it would be possible to make a complete Scan, otherwise one must provide a file with URLs
- local crawling (directly by hard-disc disk path) apparent not possible, since links for the resources are needed
- Cache functionality (MD5-Hashing): Stores whether URLs or documents were changed since the index production
- on very large data sets successfully tested
- Front-end: Visualization of the text context of the retrieval query with the results
Red Piranha 0.3
- Supports PDF (Plain text, XML, HTML)
- PPT is apparent parsed binary, as all other unknown formats
- Result system of evaluation and/or training system (I like this, not for me)
- Over "ADD information" data can be added to the index, but apparent not removed
- Crawling cannot be broken off
- Crawling needs only for Info2-Slides (approx. 21.1 MB) 45 minutes
- Index has about 7% of the size of the indicated data => with 360 GB: Index size: 25.2 GB
- Access to the index during the index production possible
- seems to hold the index Up-to-date (background process) --> administration of large data sets possible?
Both systems (disadvantages)
- are probably laid out rather for the crawling of Internet presences as for a file server
- Changes in the program code are essential to to keep the demanded meta data with the eLecture portal
- Cut off a path-prefixes in the PlugIns (formation of the URI)
- With nutch momentarily all document types would have implemented; with Red Piranha Flash and Lecturnity
- Appropriate changes to front-end
- own (on lucene based) data formats for the appropriate auxiliary features to be needed
- Documentations meagerly Both systems (advantages)
- nutch community would be pleased over additionally developed PlugIns
- Expansion on other document types (by the systems to be already supported) ev. not very complex
- When using RP: Additional ev. useful feature (training system) already integrated
- When using Nutch: Cache feature