1. Overview
The Search & Crawl tool is a powerful utility for interacting with web content. It combines a robust Google search function with direct web page crawling capabilities, allowing you to find information and extract it directly from the source.
With the GoInsight Search & Crawl node, you can integrate dynamic web data into your workflows. This enables a wide range of automated tasks, including:
- Performing Google Searches: Execute single or batch search queries to retrieve structured search results.
- Deep Content Extraction: Go beyond search snippets by enabling deep crawling to analyze and extract the most relevant content from result pages.
- Direct Web Page Crawling: Scrape the text content from one or multiple specific URLs.
- AI-Powered Summarization: Automatically generate concise summaries of crawled web page content in your desired language.
2. Prerequisites
This tool is generally ready for use without requiring a specific third-party account. However, usage may be subject to fair-use policies and rate limits to ensure service stability. No special permissions are needed to use this node.
3. Credentials
Regarding how to obtain and configure credentials, please refer to our official documentation: Credentials Configuration Guide.
4. Supported Operations
This node provides operations centered around searching the web and crawling web pages.
Summary
Resource | Operation | Description |
---|---|---|
Search | Google Search | Performs Google searches with support for batch keyword queries, deep crawling of search result pages, and formatted output. The tool retrieves structured search results or, when deep crawling is enabled, extracts detailed content from the most relevant pages, providing richer insights than standard search snippets. |
Web Page | Web Crawl | Extract content from a single web pages and optionally generate AI-powered summaries for quick insights. |
Web Page | Web Crawl Batch | Extract content from single or multiple web pages and optionally generate AI-powered summaries for quick insights. |
Operation Details
Google Search
Performs Google searches with support for batch keyword queries, deep crawling of search result pages, and formatted output. The tool retrieves structured search results or, when deep crawling is enabled, extracts detailed content from the most relevant pages, providing richer insights than standard search snippets.
Input Parameters:
- SearchQueries: Comma-separated search phrases (e.g. "Super Bowl commercials,NBA playoff schedule")
Options:
- MaxQueryGroups: Maximum number of query groups to process (1-5, default=3)
- SearchResultCount: Number of search results to return (range: 10-100)
- SearchLocale: Parameters are used to display search results for a specific language and region. example: en-US, en-GB, id-ID, and many more
- DeepCrawl: Enable intelligent page analysis to extract the most relevant content from search results; when true, only Markdown summaries of crawled pages (no standard search lists) will be returned.
- MaxCrawlPages: Maximum number of pages to deep crawl (range: 1-10, default=5)
- OutputAsMarkdown: Output results as Markdown (True) or JSON string (False).
Output:
- SearchResults (string): JSON string containing search results structured as [{"query": "search phrase", "organic_results": [{"title": "Result Title", "snippet": "Summary text", "link": "https://example.com"}]}]. When OutputAsMarkdown=true, results are formatted as human-readable Markdown with headers, links, and dividers.
- StatusCode (number): HTTP status code reflecting the success or failure of the search request (e.g., 200 for success, 403 for unauthorized access, 500 for internal errors). Directly corresponds to the underlying service’s response status.
- ErrorMessage (string): Detailed error description if the search request fails. Empty if the request succeeds. Common examples: "Invalid search query provided", "Rate limit exceeded", "Service unavailable".
Web Crawl
Extract content from a single web pages and optionally generate AI-powered summaries for quick insights.
Input Parameters:
- WebPageLink: Single URL of the web page to crawl. Example: "https://example.com".
Options:
- UseAISummarization: Set to true to use AI to generate summaries of scraped content; false to return raw data only.
- SummaryLanguage: Specifies the target language for AI-generated summaries. Defaults to "English". Use full language names (e.g., "Chinese", "Spanish", "Japanese")
Output:
- Title (string): Extracted title of the crawled web page. Returns an empty string if no title is found (e.g., for error pages or invalid URLs).
- Content (string): Cleaned text content of the web page. If UseAISummarization is enabled, this field contains the AI-generated summary instead of raw text.
- StatusCode (number): HTTP status code of the request (e.g., 200 for success, 404 for not found, 500 for server error). Directly maps to the server’s response status.
- ErrorMessage (string): Detailed error description if the request fails. Remains empty if the crawl is successful. Examples: "Invalid URL provided", "Connection timeout", "Server error".
Web Crawl Batch
Extract content from single or multiple web pages and optionally generate AI-powered summaries for quick insights.
Input Parameters:
- WebPageLinks: Comma-separated list of web page URLs to scrape. Example: "https://example.com,https://another-site.com".
Options:
- MaxCrawlPages: Maximum number of pages to scrape per URL. Range: 1-10, Default: 5. Adjust this parameter to control crawling depth.
- UseAISummarization: Set to true to use AI to generate summaries of scraped content; false to return raw data only.
- SummaryLanguage: Specifies the target language for AI-generated summaries. Defaults to "English". Use full language names (e.g., "Chinese", "Spanish", "Japanese")
Output:
- CrawlResultsJson (string): JSON string containing crawl results structured as [{url: 'crawled URL', title: 'page title', content: 'cleaned text (AI summary if UseAISummarization=True)', status_code: HTTP code, error_message: 'error details'}].
5. Example Usage
This section will guide you through creating a simple workflow to search Google for a specific topic and retrieve the results.
The workflow will consist of three nodes: Start -> Google Search -> Answer.
- Add the Tool Node
- In the workflow canvas, click the "+" icon to add a new node.
- Select the "Tools" tab in the pop-up panel.
- Find and select "Search_&_crawl" from the list of tools.
- In the list of supported operations for Search_&_crawl, click on "Google Search" to add the node to your canvas.
- Configure the Node
- Click on the newly added "Google Search" node to open its configuration panel on the right.
- Credentials Configuration: This tool does not require special credentials. You can leave this field as is.
- Parameter Configuration: Fill in the input parameters to define your search.
- SearchQueries: Enter the topic you want to search for. For instance, to find recent news about artificial intelligence, you could type "latest advancements in AI".
- SearchResultCount (Optional): To limit the number of results, you can set this to a value like 10.
- OutputAsMarkdown (Optional): Leave this as the default true for a nicely formatted, human-readable output.
- Run and Validate
- Once all required parameters are correctly filled, any error indicators on the workflow canvas will disappear.
- Click the "Run" button in the top-right corner of the canvas to execute the workflow.
- After a successful run, you can click the log icon in the top-right corner to view the detailed inputs and outputs of the node, verifying that the search was performed correctly and results were returned.
After completing these steps, your workflow is fully configured. When you run it, the node will perform a Google search based on your query and output the results, which can then be used by other nodes in your workflow.
6. FAQs
Q: Why did my search or crawl request fail with a 403 or 500 status code?
A: These errors can occur for several reasons.
- Rate Limiting: You may have made too many requests in a short period.
- Website Blocking: The target website may have security measures in place to block automated scraping or crawling.
- Service Issues: There might be a temporary issue with the search provider or the target website. Try the request again after a short wait.
Q: The crawled content is empty or missing information I see in my browser. Why?
A: This can happen if a website relies heavily on JavaScript to load its content dynamically. The basic crawler fetches the initial HTML of the page, and may not execute the JavaScript needed to render the full content.
Q: How can I use the JSON output from Google Search or Web Crawl Batch in other nodes?
A: The JSON output is returned as a single string. To access the data within it, you can connect the output to a Code node. Inside the Code node, you can use standard functions (e.g., JSON.parse() in JavaScript) to convert the string into a structured object, allowing you to easily access specific fields like titles, links, and snippets for use in subsequent steps of your workflow.
7. Official Documentation
For more advanced usage and a deeper understanding of web scraping principles, you can refer to general web development documentation and best practices.
Leave a Reply.