Apache lucene web crawler

3/18/2023

Apache lucene web crawler

Read Now

This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing. This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball-only about 2 MB). This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields). Various bug fixes, and speedups (e.g., to Fetcher2) have also been included. This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case. In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation. In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. The two facilities have been spun out into their own subproject, called Hadoop. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. In June, 2003, a successful 100-million-page demonstration system was developed. Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella. The fetcher ("robot" or " web crawler") has been written from scratch specifically for this project. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats.

0 Comments

Apache lucene web crawler

Leave a Reply.

Author

Archives

Categories