Apify Community

Apify Discord Mirror

Request works in Postman but doesn't works in crawler even with full browser

Hello I'm trying to handle ajax call via got-scraping. I prepare call in postman, where it works fine. But if I want to try it in Actor a got 403 every time. Even if I try i via Puppeteer or Playwrite and click on the button with request I got response with geo.captcha-delivery.com/captcha url to solve it.
Please can anybody give me any advice how to handle this issue?

1 comment

DDoigus

Camoufox and adaptive playwright

Hello great friends of Crawlee,
I was wondering if there was anyway to use camoufox and the adaptive playwright browser?

It seems to throw an error when I try to add the browser pool.

2 comments

ttrizz8735

Actor Inquiry

My team and I saw online that if we load our scraper as an actor to Apify’s platform we could get a Apify hoodie. Is that true?

1 comment

BBlackCoder

Hey ,why do i get web scrapping of first url , since i have another url .

I am implemented Playwright crawler to parse the url , I made a single request to crawler with first url, since the request has been processing , meanwhile , i passed anotther url in craler and hit the request, While processing, through crawler, it is processing content from first url , instead of second url both times. Can be please help?

async def run_crawler(url, domain_name, save_path=None):
print("doc url inside crawler file====================================>", url)
crawler = PlaywrightCrawler(
max_requests_per_crawl=10,
browser_type='firefox',
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {url} ...')

links = await context.page.evaluate(f'''() => {{
return Array.from(document.querySelectorAll('a[href*="{domain_name}"]'))
.map(a => a.href);
}}''')

await context.enqueue_links(urls=links)

elements = await context.page.evaluate(PW_SCRAPING_CODE)

data = {
'url': url,
'title': await context.page.title(),
'content': elements
}
print("datat =================>", data)

await context.push_data(data)

await crawler.run([url])

i am calling the craler using

hhuey louie dewey

proxy_config.new_url() does not return new proxy

Here is my selenium python script, where i try to rotate proxies using the proxy_config.new_url():

Plain Text

# Standard libraries
import asyncio
import logging
import json

# Installed libraries
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.proxy import ProxyType, Proxy
from selenium.webdriver.common.by import By
from selenium import webdriver
from apify import Actor

async def main() -> None:
    async with Actor:
        Actor.log.setLevel(logging.DEBUG)
        proxy_config = await Actor.create_proxy_configuration(groups=['RESIDENTIAL'])
        url = "https://api.ipify.org?format=json"
        for _ in range(10):
            proxy = await proxy_config.new_url()
            Actor.log.info(f'Using proxy: {proxy}')
            chrome_options = ChromeOptions()
            chrome_options.add_argument("--headless")
            chrome_options.add_argument("--no-sandbox")
            chrome_options.add_argument("--disable-dev-shm-usage")
            chrome_options.proxy = Proxy({'proxyType': ProxyType.MANUAL, 'httpProxy': proxy})
            try:
                with webdriver.Chrome(options=chrome_options) as driver:
                    driver.set_page_load_timeout(20)
                    driver.get(url)
                    content = driver.find_element(By.TAG_NAME, 'pre').text
                    ip = json.loads(content).get("ip")
                    Actor.log.info(f"IP = {ip}")
            except (TimeoutException, WebDriverException, json.JSONDecodeError):
                Actor.log.exception("An error occured")

Due to discord message size limitation i attach the log output of the above code in a new message below...

2 comments

nnew_in_town

about RESIDENTIAL proxies

Hi all, what is your experience with RESIDENTIAL proxies?

Let us share:

provider URL
price /GB residential traffic
their advantages/disadvantages

My experience:
iproyal.com, "royal-residential-proxies" $5.51 per GB with "Pay As You Go" option, I am paying for $66.15 for 12 GB

These are good proxies, everything works.
But expensive.
Recently I've been seeing that the gigabytes I bought are running out too fast.

1 comment

WWalshie

Changes to pricing - Google Maps Scraper by Compass

Hi, I recently signed up to Apify for the data I required and it was brilliant last week. I have the $39 package and I was able to scrape all 65,000 locations I needed, in a shot period of time and with all the info required. But due to the new pricing policy, this same data will cost me over $800! I understand an increasing and I'm all for that, but this is huge!! Is there no middles ground? I'm happy to wait for the data to scrape, I don't need it in seconds or even minutes. It took overnight to gather the 65000 result and that was perfectly acceptable

iiDora

Solved

served with unsupported charset/encoding: ISO-88509-1

Reclaiming failed request back to the list or queue. Resource http://www.etmoc.com/look/Looklist?Id=47463 served with unsupported charset/encoding: ISO-88509-1

4 comments

MMartin

Cannot detect CDP client for Puppeteer

Hi,

How to fix this?

Failed to compile
./node_modules/.pnpm/@crawlee+puppeteer@3.13.0_playwright@1.50.1/node_modules/@crawlee/puppeteer/internals/utils/puppeteer_utils.js:224:22

Module not found: Can't resolve 'puppeteer/package.json'
  222 |         return client.send(command, ...args);
  223 |     }
> 224 |     const jsonPath = require.resolve('puppeteer/package.json');
      |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  225 |     const parsed = JSON.parse(await (0, promises_1.readFile)(jsonPath, 'utf-8'));
  226 |     throw new Error(

Cannot detect CDP client for Puppeteer ${parsed.version}. You should report this to Crawlee, mentioning the puppeteer version you are using.

);
  227 | }

https://nextjs.org/docs/messages/module-not-found

3 comments

AAsher

I have made scraper to get the data from Linkedin, Glassdoor, indeed and Ziprecruiter.

But the apify api sometimes didn't work correctly,
I mean sometimes the result is zero.

what is the reason,
I just guess the reason is rate limitation, I am not sure

Please Help me out, Thank you

1 comment

LLouis Deconinck

Playwright Browser Launch Failure: Deprecated Headless Mode in Chrome

Anyone know how to fix this error? Works perfectly locally, but not on the Apify platform. See file attached with full error log.

Code snippet:

Plain Text

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request, log }) => {
            if (playwrightCookies.length > 0) {
                log.info(`Setting ${playwrightCookies.length} cookies for ${request.url}...`);
                await page.context().addCookies(playwrightCookies);
            }
        },
    ],
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url}...`);

        const startTime = Date.now();

        if (waitForSelector) {
            try {
                await page.waitForSelector(waitForSelector, { timeout: 60000 });
            } catch (error) {
                log.info(`Selector "${waitForSelector}" not detected after 1 minute. Continuing...`);
            }
        }

1 comment

VVJ

Cannot Update Existing Tasks with new inputs

Hey guys! I have this problem where when I add a new input field to an actor, the tasks associated with that actor don't get updated with that field.

So does that mean I have to recreate the tasks again from scratch?? :((

2 comments

AAlbin

Run Actor Response

Im using the Run Actor API, and im using webhooks in the actor run, and i get this eventType ACTOR.RUN.SUCCEEDED and this is okay, but i wanted to know how the reponse will be if the eventType is ACTOR.RUN.FAILED, can someone help me with this ?

XXeno

Chrome Path

Hello, can you let me know what the path for Chrome is?

5 comments

ffrusdevv

failed to parse module map error

React JS

WARNING in ./node_modules/apify-client/dist/resource_clients/user.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/user.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/user.ts'

WARNING in ./node_modules/apify-client/dist/resource_clients/webhook.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook.ts'

WARNING in ./node_modules/apify-client/dist/resource_clients/webhook_collection.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook_collection.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook_collection.ts'

jjransom33

Proxy example with PlaywrightCrawler

This is probably a simple fix but I cannot find an example of crawlee using a simple proxy link with Playwright. If anyone has a working example or know what is wrong in the code I would really appreciate your help. Here is the code I have been working with:

(I wish I could copy and paste of the code here but the post go over the character limit

I get the following error from the code:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jr/Desktop/Pasos_webscraping/.venv/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_CERT_AUTHORITY_INVALID at https://www.instagram.com/p/DGWPnK1S0K2/
Call log:

navigating to "https://www.instagram.com/p/DGWPnK1S0K2/", waiting until "load"

Any help on how to proceed would be greatly appreciated!

1 comment

bbrandoncroberts

puppeteer runs often stopping due to `ProtocolError ` after `Memory is critically overloaded.`

During my Apify scraping runs with Crawlee / puppeteer, 32GB RAM per run, my jobs stop showing There was an uncaught exception during the run of the Actor and it was not handled.
And the logs you see in the screenshot at the end.
This often happens for runs that are running for 30+ minutes. Under 30 minutes is less likely to have this error.
I've tried "Increase the 'protocolTimeout' setting ", but observed that the error still happens, just after a longer wait.
Tried different concurrency settings, even leaving to default, but consistently have seen this error.

Plain Text

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                "--no-sandbox", // Mitigates the "sandboxed" process issue in Docker containers,
                "--ignore-certificate-errors",
                "--disable-dev-shm-usage",
                "--disable-infobars",
                "--disable-extensions",
                "--disable-setuid-sandbox",
                "--ignore-certificate-errors",
                "--disable-gpu", // Mitigates the "crashing GPU process" issue in Docker containers
            ],
        },
    },
    maxRequestRetries: 1,
    navigationTimeoutSecs: 60,
    autoscaledPoolOptions: { minConcurrency: 30 },
    maxSessionRotations: 5,
    preNavigationHooks: [
        async ({ blockRequests }, goToOptions) => {
            if (goToOptions) goToOptions.waitUntil = "domcontentloaded"; // Set waitUntil here
            await blockRequests({
                urlPatterns: [
...
                ],
            });
        },
    ],
    proxyConfiguration,
    requestHandler: router,
});
await crawler.run(startUrls);
await Actor.exit();

1 comment

ZZhasulyainou

error in loader module

Hi! Error with Lodash in Crawlee

Please help. I ran the actor and got this error. I tried changing to different versions of Crawlee, but the error still persists.

node:internal/modules/cjs/loader:1140
const err = new Error(message);
^

Error: Cannot find module './_baseGet'
Require stack:

C:\wedat\dat-spain\apps\actor\node_modules\lodash\get.js
C:\wedat\dat-spain\apps\actor\node_modules@sapphire\shapeshift\dist\cjs\index.cjs
C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\memory-storage.js
C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\index.js
C:\wedat\dat-spain\apps\actor\node_modules@crawlee\core\configuration.js

4 comments

FFabien

Saving the working configurations & Sessions for each sites

Hi!

I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this.

My use case:
I'm scraping 100 websites multiple times a day. I'd like to save the working configurations (cookies, headers, proxy) for each site.

From what I understand, Session are made for this.
However, I'd like to have the working Sessions in my database: this way working sessions persists even if the script shutdown...

Also, saving the working configurations in a database would be useful when scaling Crawlee to multiple server instances.

My ideal scenario would be to save all the configurations for each sites (including the type of crawler used (cheerio, got, playwright), css selectors, proxy needs, headers, cookies...)

Thanks a lot for your help!

3 comments

MMuhammet

PPE Actors Misclassified as Paid Users & Need for Detailed Analytics

Using our own developed PPE actors causes us to appear as paid users on the analytics dashboard. However, using our own PPR and rented actors does not reflect as a paying user. This issue with PPE actors can be confusing for developers, and since there is no actual profit/cost change, it may appear as if the actor has issues with charging.

Additionally, having more detailed indicators for PPE actors in the analytics dashboard would be very beneficial. For example, it would be great to see how much each event is charged per execution for each actor.

ccjwr

Trying to subscribe and the billing is denied

Hi, we are trying to upgrade to a paid solution and we can't get the payment through. We checked the billing details and contacted the card company, and there was no issues from their end. They said that there was no payment attempt from Apify. Can you please assist with this issue?

14 comments

rrapint

Actor run result is just {'demo': true}

I am runnign a twitter scraper actor v2 on apify, and I see that my run succeeded and says 100 resutls,
but when I got to the details page, it is just an array of 100 items of {'demo': true}
how can I get proper details?

1 comment

aazzouz

Frequent migrations causing trouble for paid actors' users...

❗ Guys, was something recently released or changed at Apify related to actors resources, etc.? I have an actor that has been running fine for a while, but in the past few days, migrations have become frequent, causing issues for some of my paid actor users. ⚠️

1 comment

!!!!Joefree!!! 👑

BUG 2024/03/19: INPUT_SCHEMA.JSON

Parameter name containing a dot (.) with editor stringList doesn't work on web console.

Example INPUT_SCHEMA.JSON

Plain Text

{
    "title"            : "Test",
    "type"            : "object",
    "schemaVersion"    : 1,
    "properties"    : {        
        
        "search.location": {"title": "Locations #1", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}, ### <-- Problem
        
        "search_location": {"title": "Locations #2", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}
    }
}

check Actor-ID: acfF0psV9y4e9Z4hq
Can't click the +Add button. When edited using Bulk button, the resulting Json is weird. It automatically become Object Structure which is nice effect. not sure if this really a Bug, or new features ?

2 comments

JJony

LinkedIn geo location id issue

I want an apify actor that scrapes and returns LinkedIn geolocation ID as output from the input location name. Is there any such actor available in the apify store or any platform in general?

2 comments