Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions THIRD-PARTY.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ List of third-party dependencies grouped by their license type.
* Apache Commons IO (commons-io:commons-io:2.21.0 - https://commons.apache.org/proper/commons-io/)
* Apache Commons Lang (org.apache.commons:commons-lang3:3.20.0 - https://commons.apache.org/proper/commons-lang/)
* Apache Commons Logging (commons-logging:commons-logging:1.2 - http://commons.apache.org/proper/commons-logging/)
* Apache Commons Logging (commons-logging:commons-logging:1.3.3 - https://commons.apache.org/proper/commons-logging/)
* Apache Commons Logging (commons-logging:commons-logging:1.3.6 - https://commons.apache.org/proper/commons-logging/)
* Apache Commons Math (org.apache.commons:commons-math3:3.6.1 - http://commons.apache.org/proper/commons-math/)
* Apache FontBox (org.apache.pdfbox:fontbox:3.0.7 - http://pdfbox.apache.org/)
Expand All @@ -51,7 +52,10 @@ List of third-party dependencies grouped by their license type.
* Apache HBase Unsafe Wrapper (org.apache.hbase.thirdparty:hbase-unsafe:4.1.12 - https://hbase.apache.org/hbase-unsafe)
* Apache HttpAsyncClient (org.apache.httpcomponents:httpasyncclient:4.1.5 - http://hc.apache.org/httpcomponents-asyncclient)
* Apache HttpClient (org.apache.httpcomponents:httpclient:4.5.14 - http://hc.apache.org/httpcomponents-client-ga)
* Apache HttpClient (org.apache.httpcomponents.client5:httpclient5:5.3.1 - https://hc.apache.org/httpcomponents-client-5.0.x/5.3.1/httpclient5/)
* Apache HttpClient Mime (org.apache.httpcomponents:httpmime:4.5.14 - http://hc.apache.org/httpcomponents-client-ga)
* Apache HttpComponents Core HTTP/1.1 (org.apache.httpcomponents.core5:httpcore5:5.2.5 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2.5/httpcore5/)
* Apache HttpComponents Core HTTP/2 (org.apache.httpcomponents.core5:httpcore5-h2:5.2.5 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2.5/httpcore5-h2/)
* Apache HttpCore (org.apache.httpcomponents:httpcore:4.4.16 - http://hc.apache.org/httpcomponents-core-ga)
* Apache HttpCore NIO (org.apache.httpcomponents:httpcore-nio:4.4.16 - http://hc.apache.org/httpcomponents-core-ga)
* Apache James :: Mime4j :: Core (org.apache.james:apache-mime4j-core:0.8.13 - http://james.apache.org/mime4j/apache-mime4j-core)
Expand Down Expand Up @@ -212,6 +216,7 @@ List of third-party dependencies grouped by their license type.
* opensearch-compress (org.opensearch:opensearch-compress:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-core (org.opensearch:opensearch-core:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-geo (org.opensearch:opensearch-geo:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* OpenSearch Java Client (org.opensearch.client:opensearch-java:2.13.0 - https://github.com/opensearch-project/opensearch-java/)
* opensearch-secure-sm (org.opensearch:opensearch-secure-sm:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-task-commons (org.opensearch:opensearch-task-commons:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* opensearch-telemetry (org.opensearch:opensearch-telemetry:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
Expand Down Expand Up @@ -248,6 +253,7 @@ List of third-party dependencies grouped by their license type.
* Playwright - Main Library (com.microsoft.playwright:playwright:1.58.0 - https://github.com/microsoft/playwright-java/playwright)
* proto-google-common-protos (com.google.api.grpc:proto-google-common-protos:2.59.2 - https://github.com/googleapis/sdk-platform-java)
* rank-eval (org.opensearch.plugin:rank-eval-client:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* rest (org.opensearch.client:opensearch-rest-client:2.12.0 - https://github.com/opensearch-project/OpenSearch.git)
* rest (org.opensearch.client:opensearch-rest-client:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* rest-high-level (org.opensearch.client:opensearch-rest-high-level-client:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* rome (com.rometools:rome:2.1.0 - http://rometools.com/rome)
Expand All @@ -256,6 +262,7 @@ List of third-party dependencies grouped by their license type.
* Shaded Deps for Storm Client (org.apache.storm:storm-shaded-deps:2.8.5 - https://storm.apache.org/storm-shaded-deps)
* SnakeYAML (org.yaml:snakeyaml:2.6 - https://bitbucket.org/snakeyaml/snakeyaml)
* snappy-java (org.xerial.snappy:snappy-java:1.1.10.4 - https://github.com/xerial/snappy-java)
* sniffer (org.opensearch.client:opensearch-rest-client-sniffer:2.12.0 - https://github.com/opensearch-project/OpenSearch.git)
* sniffer (org.opensearch.client:opensearch-rest-client-sniffer:2.19.5 - https://github.com/opensearch-project/OpenSearch.git)
* SparseBitSet (com.zaxxer:SparseBitSet:1.3 - https://github.com/brettwooldridge/SparseBitSet)
* storm-autocreds (org.apache.storm:storm-autocreds:2.8.5 - https://storm.apache.org/external/storm-autocreds)
Expand Down Expand Up @@ -344,6 +351,10 @@ List of third-party dependencies grouped by their license type.
* JAXB Runtime (org.glassfish.jaxb:jaxb-runtime:4.0.7 - https://eclipse-ee4j.github.io/jaxb-ri/)
* TXW2 Runtime (org.glassfish.jaxb:txw2:4.0.7 - https://eclipse-ee4j.github.io/jaxb-ri/)

Eclipse Distribution License v. 1.0, Eclipse Public License v. 2.0

* org.eclipse.yasson (org.eclipse:yasson:2.0.2 - https://projects.eclipse.org/projects/ee4j.yasson)

Eclipse Public License, Version 2.0, GPL-2.0-with-classpath-exception

* Jakarta RESTful WS API (jakarta.ws.rs:jakarta.ws.rs-api:3.1.0 - https://github.com/eclipse-ee4j/jaxrs-api)
Expand All @@ -352,6 +363,13 @@ List of third-party dependencies grouped by their license type.

* Jakarta Annotations API (jakarta.annotation:jakarta.annotation-api:1.3.5 - https://projects.eclipse.org/projects/ee4j.ca)

Eclipse Public License 2.0, GNU General Public License, version 2 with the GNU Classpath Exception

* Eclipse Parsson (org.eclipse.parsson:parsson:1.1.6 - https://github.com/eclipse-ee4j/parsson/parsson)
* Jakarta JSON Processing API (jakarta.json:jakarta.json-api:2.1.3 - https://github.com/eclipse-ee4j/jsonp)
* JSON-B API (jakarta.json.bind:jakarta.json.bind-api:2.0.0 - https://eclipse-ee4j.github.io/jsonb-api)
* JSON-P Default Provider (org.glassfish:jakarta.json:2.0.0 - https://github.com/eclipse-ee4j/jsonp)

GENERAL PUBLIC LICENSE, version 3 (GPL-3.0), GNU LESSER GENERAL PUBLIC LICENSE, version 3 (LGPL-3.0), Mozilla Public License Version 1.1

* juniversalchardet (com.github.albfernandez:juniversalchardet:2.5.0 - https://github.com/albfernandez/juniversalchardet)
Expand Down
70 changes: 70 additions & 0 deletions external/opensearch-java/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
stormcrawler-opensearch
===========================

A collection of resources for [OpenSearch](https://opensearch.org/):
* [IndexerBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java) for indexing documents crawled with StormCrawler
* [Spouts](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/AggregationSpout.java) and [StatusUpdaterBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java) for persisting URL information in recursive crawls
* [MetricsConsumer](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/MetricsConsumer.java)
* [StatusMetricsBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/StatusMetricsBolt.java) for sending the breakdown of URLs per status as metrics and display its evolution over time.

as well as resources for building basic real-time monitoring dashboards for the crawls, see below.

This module is ported from the Elasticsearch one.

Getting started
---------------------

The easiest way is currently to use the archetype for OpenSearch with:

`mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-opensearch-archetype -DarchetypeVersion=3.4.0`

You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use.

This will not only create a fully formed project containing a POM with the dependency above but also a set of resources, configuration files and a topology class. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

You will of course need to have both Storm and OpenSearch installed. For the latter, the [OpenSearch documentation](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/) contains resources for Docker.

Unlike in the Elastic module, the schemas are automatically created by the bolts. You can of course override them by using the script 'OS_IndexInit.sh' generated by the archetype, the index definitions are located in _src/main/resources_.


Dashboards
---------------------

To import the dashboards into a local instance of OpenSearch Dashboard, go into the folder _dashboards_ and run the script _importDashboards.sh_.

You should see something like

```
Importing status dashboard into OpenSearch Dashboards
{"successCount":4,"success":true,"successResults":[{"type":"index-pattern","id":"7445c390-7339-11e9-9289-ffa3ee6775e4","meta":{"title":"status","icon":"indexPatternApp"}},{"type":"visualization","id":"status-count","meta":{"title":"status count","icon":"visualizeApp"}},{"type":"visualization","id":"Top-Hosts","meta":{"title":"Top Hosts","icon":"visualizeApp"}},{"type":"dashboard","id":"Crawl-status","meta":{"title":"Crawl status","icon":"dashboardApp"}}]}
Importing metrics dashboard into OpenSearch Dashboards
{"successCount":9,"success":true,"successResults":[{"type":"index-pattern","id":"b5c3bbd0-7337-11e9-9289-ffa3ee6775e4","meta":{"title":"metrics","icon":"indexPatternApp"}},{"type":"visualization","id":"Fetcher-:-#-active-threads","meta":{"title":"Fetcher : # active threads","icon":"visualizeApp"}},{"type":"visualization","id":"Fetcher-:-num-queues","meta":{"title":"Fetcher : num queues","icon":"visualizeApp"}},{"type":"visualization","id":"Fetcher-:-pages-fetched","meta":{"title":"Fetcher : pages fetched","icon":"visualizeApp"}},{"type":"visualization","id":"Fetcher-:-URLs-waiting-in-queues","meta":{"title":"Fetcher : URLs waiting in queues","icon":"visualizeApp"}},{"type":"visualization","id":"Fetcher-:-average-bytes-per-second","meta":{"title":"Fetcher : average bytes per second","icon":"visualizeApp"}},{"type":"visualization","id":"Fetcher-:-average-pages-per-second","meta":{"title":"Fetcher : average pages per second","icon":"visualizeApp"}},{"type":"visualization","id":"Total-bytes-fetched","meta":{"title":"Total bytes fetched","icon":"visualizeApp"}},{"type":"dashboard","id":"Crawl-metrics","meta":{"title":"Crawl metrics","icon":"dashboardApp"}}]}
```

The [dashboard screen](http://localhost:5601/app/dashboards#/list?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))) should show both the status and metrics dashboards. If you click on `Crawl Status`, you should see 2 tables containing the count of URLs per status and the top hostnames per URL count.
The [Metrics dashboard](http://localhost:5601/app/dashboards#/view/Crawl-metrics) can be used to monitor the progress of the crawl.

The file _storm.ndjson_ is used to display some of Storm's internal metrics and is not added by default.

#### Per time period metric indices (optional)

The _metrics_ index can be configured per time period. This best practice is [discussed on the Elastic website](https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html).

The crawler config YAML must be updated to use an optional argument as shown below to have one index per day:

```
#Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.stormcrawler.opensearch.metrics.MetricsConsumer"
parallelism.hint: 1
argument: "yyyy-MM-dd"
```








72 changes: 72 additions & 0 deletions external/opensearch-java/archetype/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<?xml version="1.0" encoding="UTF-8"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler</artifactId>
<version>3.5.2-SNAPSHOT</version>
<relativePath>../../../pom.xml</relativePath>
</parent>

<artifactId>stormcrawler-opensearch-java-archetype</artifactId>

<packaging>maven-archetype</packaging>

<build>

<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
<includes>
<include>META-INF/maven/archetype-metadata.xml</include>
</includes>
</resource>
<resource>
<directory>src/main/resources</directory>
<filtering>false</filtering>
<excludes>
<exclude>META-INF/maven/archetype-metadata.xml</exclude>
</excludes>
</resource>
</resources>

<extensions>
<extension>
<groupId>org.apache.maven.archetype</groupId>
<artifactId>archetype-packaging</artifactId>
<version>3.4.1</version>
</extension>
</extensions>

<pluginManagement>
<plugins>
<plugin>
<artifactId>maven-archetype-plugin</artifactId>
<version>3.4.1</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to you under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
def file1 = new File(request.getOutputDirectory(), request.getArtifactId() + "/dashboards/importDashboards.sh")
file1.setExecutable(true, false)

def file2 = new File(request.getOutputDirectory(), request.getArtifactId() + "/OS_IndexInit.sh")
file2.setExecutable(true, false)
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<?xml version="1.0" encoding="UTF-8"?>

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<archetype-descriptor
xmlns="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0 http://maven.apache.org/xsd/archetype-descriptor-1.1.0.xsd"
name="stormcrawler-core">

<requiredProperties>
<requiredProperty key="http-agent-name">
<validationRegex>^[a-zA-Z_\-]+$</validationRegex>
</requiredProperty>
<requiredProperty key="http-agent-version" />
<requiredProperty key="http-agent-description" />
<requiredProperty key="http-agent-url" />
<requiredProperty key="http-agent-email">
<validationRegex>^\S+@\S+\.\S+$</validationRegex>
</requiredProperty>
<requiredProperty key="StormCrawlerVersion">
<defaultValue>${project.version}</defaultValue>
</requiredProperty>
</requiredProperties>

<fileSets>
<fileSet filtered="true" encoding="UTF-8">
<directory>src/main/resources</directory>
<includes>
<include>**/*.xml</include>
<include>**/*.txt</include>
<include>**/*.yaml</include>
<include>**/*.json</include>
<include>**/*.mapping</include>
</includes>
</fileSet>
<fileSet filtered="true" encoding="UTF-8">
<directory></directory>
<includes>
<include>README.md</include>
<include>*.flux</include>
<include>*.yaml</include>
<include>*.sh</include>
</includes>
</fileSet>
<fileSet filtered="true" encoding="UTF-8">
<directory>dashboards</directory>
<includes>
<include>*.sh</include>
<include>*.ndjson</include>
</includes>
</fileSet>
</fileSets>

</archetype-descriptor>
Loading