Location via proxy:   
[Report a bug]   [Manage cookies]                

Apify Discord Mirror

Hello I'm trying to handle ajax call via got-scraping. I prepare call in postman, where it works fine. But if I want to try it in Actor a got 403 every time. Even if I try i via Puppeteer or Playwrite and click on the button with request I got response with geo.captcha-delivery.com/captcha url to solve it.
Please can anybody give me any advice how to handle this issue?
1 comment
L
Hello great friends of Crawlee,
I was wondering if there was anyway to use camoufox and the adaptive playwright browser?

It seems to throw an error when I try to add the browser pool.
2 comments
m
M
My team and I saw online that if we load our scraper as an actor to Apify’s platform we could get a Apify hoodie. Is that true?
1 comment
S
I am implemented Playwright crawler to parse the url , I made a single request to crawler with first url, since the request has been processing , meanwhile , i passed anotther url in craler and hit the request, While processing, through crawler, it is processing content from first url , instead of second url both times. Can be please help?


async def run_crawler(url, domain_name, save_path=None):
print("doc url inside crawler file====================================>", url)
crawler = PlaywrightCrawler(
max_requests_per_crawl=10,
browser_type='firefox',
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {url} ...')

links = await context.page.evaluate(f'''() => {{
return Array.from(document.querySelectorAll('a[href*="{domain_name}"]'))
.map(a => a.href);
}}''')

await context.enqueue_links(urls=links)

elements = await context.page.evaluate(PW_SCRAPING_CODE)

data = {
'url': url,
'title': await context.page.title(),
'content': elements
}
print("datat =================>", data)

await context.push_data(data)

await crawler.run([url])


i am calling the craler using
Here is my selenium python script, where i try to rotate proxies using the proxy_config.new_url():
Plain Text
# Standard libraries
import asyncio
import logging
import json

# Installed libraries
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.proxy import ProxyType, Proxy
from selenium.webdriver.common.by import By
from selenium import webdriver
from apify import Actor

async def main() -> None:
    async with Actor:
        Actor.log.setLevel(logging.DEBUG)
        proxy_config = await Actor.create_proxy_configuration(groups=['RESIDENTIAL'])
        url = "https://api.ipify.org?format=json"
        for _ in range(10):
            proxy = await proxy_config.new_url()
            Actor.log.info(f'Using proxy: {proxy}')
            chrome_options = ChromeOptions()
            chrome_options.add_argument("--headless")
            chrome_options.add_argument("--no-sandbox")
            chrome_options.add_argument("--disable-dev-shm-usage")
            chrome_options.proxy = Proxy({'proxyType': ProxyType.MANUAL, 'httpProxy': proxy})
            try:
                with webdriver.Chrome(options=chrome_options) as driver:
                    driver.set_page_load_timeout(20)
                    driver.get(url)
                    content = driver.find_element(By.TAG_NAME, 'pre').text
                    ip = json.loads(content).get("ip")
                    Actor.log.info(f"IP = {ip}")
            except (TimeoutException, WebDriverException, json.JSONDecodeError):
                Actor.log.exception("An error occured")

Due to discord message size limitation i attach the log output of the above code in a new message below...
2 comments
h
Hi all, what is your experience with RESIDENTIAL proxies?

Let us share:
  • provider URL
  • price /GB residential traffic
  • their advantages/disadvantages
My experience:
iproyal.com, "royal-residential-proxies" $5.51 per GB with "Pay As You Go" option, I am paying for $66.15 for 12 GB

These are good proxies, everything works.
But expensive.
Recently I've been seeing that the gigabytes I bought are running out too fast.
1 comment
n
Hi, I recently signed up to Apify for the data I required and it was brilliant last week. I have the $39 package and I was able to scrape all 65,000 locations I needed, in a shot period of time and with all the info required. But due to the new pricing policy, this same data will cost me over $800! I understand an increasing and I'm all for that, but this is huge!! Is there no middles ground? I'm happy to wait for the data to scrape, I don't need it in seconds or even minutes. It took overnight to gather the 65000 result and that was perfectly acceptable
Reclaiming failed request back to the list or queue. Resource http://www.etmoc.com/look/Looklist?Id=47463 served with unsupported charset/encoding: ISO-88509-1
4 comments
a
i
Hi,

How to fix this?

Failed to compile ./node_modules/.pnpm/@crawlee+puppeteer@3.13.0_playwright@1.50.1/node_modules/@crawlee/puppeteer/internals/utils/puppeteer_utils.js:224:22 Module not found: Can't resolve 'puppeteer/package.json' 222 | return client.send(command, ...args); 223 | } > 224 | const jsonPath = require.resolve('puppeteer/package.json'); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 225 | const parsed = JSON.parse(await (0, promises_1.readFile)(jsonPath, 'utf-8')); 226 | throw new Error(Cannot detect CDP client for Puppeteer ${parsed.version}. You should report this to Crawlee, mentioning the puppeteer version you are using.); 227 | } https://nextjs.org/docs/messages/module-not-found
3 comments
M
a
But the apify api sometimes didn't work correctly,
I mean sometimes the result is zero.

what is the reason,
I just guess the reason is rate limitation, I am not sure

Please Help me out, Thank you
1 comment
R
Anyone know how to fix this error? Works perfectly locally, but not on the Apify platform. See file attached with full error log.

Code snippet:
Plain Text
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request, log }) => {
            if (playwrightCookies.length > 0) {
                log.info(`Setting ${playwrightCookies.length} cookies for ${request.url}...`);
                await page.context().addCookies(playwrightCookies);
            }
        },
    ],
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url}...`);

        const startTime = Date.now();

        if (waitForSelector) {
            try {
                await page.waitForSelector(waitForSelector, { timeout: 60000 });
            } catch (error) {
                log.info(`Selector "${waitForSelector}" not detected after 1 minute. Continuing...`);
            }
        }
1 comment
L
Hey guys! I have this problem where when I add a new input field to an actor, the tasks associated with that actor don't get updated with that field.

So does that mean I have to recreate the tasks again from scratch?? :((
2 comments
V
A
Im using the Run Actor API, and im using webhooks in the actor run, and i get this eventType ACTOR.RUN.SUCCEEDED and this is okay, but i wanted to know how the reponse will be if the eventType is ACTOR.RUN.FAILED, can someone help me with this ?
X
Xeno
·

Chrome Path

Hello, can you let me know what the path for Chrome is?
5 comments
A
X
React JS

WARNING in ./node_modules/apify-client/dist/resource_clients/user.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/user.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/user.ts'

WARNING in ./node_modules/apify-client/dist/resource_clients/webhook.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook.ts'

WARNING in ./node_modules/apify-client/dist/resource_clients/webhook_collection.js
Module Warning (from ./node_modules/source-map-loader/dist/cjs.js):
Failed to parse source map from '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook_collection.ts' file: Error: ENOENT: no such file or directory, open '/Users/apple/Desktop/insta_downloader/insta_downloader/node_modules/apify-client/src/resource_clients/webhook_collection.ts'
This is probably a simple fix but I cannot find an example of crawlee using a simple proxy link with Playwright. If anyone has a working example or know what is wrong in the code I would really appreciate your help. Here is the code I have been working with:

(I wish I could copy and paste of the code here but the post go over the character limit

I get the following error from the code:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jr/Desktop/Pasos_webscraping/.venv/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_CERT_AUTHORITY_INVALID at https://www.instagram.com/p/DGWPnK1S0K2/
Call log:
Any help on how to proceed would be greatly appreciated!
1 comment
M
During my Apify scraping runs with Crawlee / puppeteer, 32GB RAM per run, my jobs stop showing There was an uncaught exception during the run of the Actor and it was not handled.
And the logs you see in the screenshot at the end.
This often happens for runs that are running for 30+ minutes. Under 30 minutes is less likely to have this error.
I've tried "Increase the 'protocolTimeout' setting ", but observed that the error still happens, just after a longer wait.
Tried different concurrency settings, even leaving to default, but consistently have seen this error.

Plain Text
const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                "--no-sandbox", // Mitigates the "sandboxed" process issue in Docker containers,
                "--ignore-certificate-errors",
                "--disable-dev-shm-usage",
                "--disable-infobars",
                "--disable-extensions",
                "--disable-setuid-sandbox",
                "--ignore-certificate-errors",
                "--disable-gpu", // Mitigates the "crashing GPU process" issue in Docker containers
            ],
        },
    },
    maxRequestRetries: 1,
    navigationTimeoutSecs: 60,
    autoscaledPoolOptions: { minConcurrency: 30 },
    maxSessionRotations: 5,
    preNavigationHooks: [
        async ({ blockRequests }, goToOptions) => {
            if (goToOptions) goToOptions.waitUntil = "domcontentloaded"; // Set waitUntil here
            await blockRequests({
                urlPatterns: [
...
                ],
            });
        },
    ],
    proxyConfiguration,
    requestHandler: router,
});
await crawler.run(startUrls);
await Actor.exit();
1 comment
O
Hi! Error with Lodash in Crawlee

Please help. I ran the actor and got this error. I tried changing to different versions of Crawlee, but the error still persists.

node:internal/modules/cjs/loader:1140
const err = new Error(message);
^

Error: Cannot find module './_baseGet'
Require stack:

  • C:\wedat\dat-spain\apps\actor\node_modules\lodash\get.js
  • C:\wedat\dat-spain\apps\actor\node_modules@sapphire\shapeshift\dist\cjs\index.cjs
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\memory-storage.js
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\memory-storage\index.js
  • C:\wedat\dat-spain\apps\actor\node_modules@crawlee\core\configuration.js
4 comments
A
O
Hi!

I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this.

My use case:
I'm scraping 100 websites multiple times a day. I'd like to save the working configurations (cookies, headers, proxy) for each site.

From what I understand, Session are made for this.
However, I'd like to have the working Sessions in my database: this way working sessions persists even if the script shutdown...

Also, saving the working configurations in a database would be useful when scaling Crawlee to multiple server instances.

My ideal scenario would be to save all the configurations for each sites (including the type of crawler used (cheerio, got, playwright), css selectors, proxy needs, headers, cookies...)

Thanks a lot for your help!
3 comments
O
F
Using our own developed PPE actors causes us to appear as paid users on the analytics dashboard. However, using our own PPR and rented actors does not reflect as a paying user. This issue with PPE actors can be confusing for developers, and since there is no actual profit/cost change, it may appear as if the actor has issues with charging.

Additionally, having more detailed indicators for PPE actors in the analytics dashboard would be very beneficial. For example, it would be great to see how much each event is charged per execution for each actor.
Hi, we are trying to upgrade to a paid solution and we can't get the payment through. We checked the billing details and contacted the card company, and there was no issues from their end. They said that there was no payment attempt from Apify. Can you please assist with this issue?
14 comments
O
c
A
I am runnign a twitter scraper actor v2 on apify, and I see that my run succeeded and says 100 resutls,
but when I got to the details page, it is just an array of 100 items of {'demo': true}
how can I get proper details?
1 comment
O
❗ Guys, was something recently released or changed at Apify related to actors resources, etc.? I have an actor that has been running fine for a while, but in the past few days, migrations have become frequent, causing issues for some of my paid actor users. ⚠️
1 comment
O
Parameter name containing a dot (.) with editor stringList doesn't work on web console.

Example INPUT_SCHEMA.JSON
Plain Text
{
    "title"            : "Test",
    "type"            : "object",
    "schemaVersion"    : 1,
    "properties"    : {        
        
        "search.location": {"title": "Locations #1", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}, ### <-- Problem
        
        "search_location": {"title": "Locations #2", "type": "array", "description":"", "editor":"stringList", "prefill": ["Bandung"]}
    }
}

check Actor-ID: acfF0psV9y4e9Z4hq
Can't click the +Add button. When edited using Bulk button, the resulting Json is weird. It automatically become Object Structure which is nice effect. not sure if this really a Bug, or new features ?
2 comments
O
!
I want an apify actor that scrapes and returns LinkedIn geolocation ID as output from the input location name. Is there any such actor available in the apify store or any platform in general?
2 comments
O
k