use_persistent_context or use_managed_browser cause the browser hang forever #430

berkaygkv · 2025-01-08T16:36:29Z

It's been a couple of days since I started using this library, awesome work thanks. I wanted to work with a consistent browser context where I have all the login history persistent across runs. To this end, I implemented the following script:

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        # use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        delay_before_return_html=125,
        session_id="12312",
        magic=True,
        adjust_viewport_to_content=True,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
            #magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

The script opens up a functional browser, I can navigate, interact with it and it's all in the user_data_dir I gave it. To make it short: everything is perfect as far as the browser configuration. However the script gets stuck before reaching to the arun method. It does not proceed to the execution of crawler tasks. I don't know if it's a bug or wrong implementation of the feature. I have searched previous issues and a couple of other examples but no luck. Any help appreciated.

Thank you

Etherdrake · 2025-01-10T02:16:46Z

I am currently having the same problem on Linux, my IP is banned from the website I am trying to access but I can access the website through a managed browser. When issuing a CTRL + C what I get is TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig.

Inside async_crawler_strategy_py I also had to change:

else:  # Linux
            paths = {
                "chromium": "/home/user/.cache/ms-playwright/chromium-1148/chrome-linux/chrome", # Made change here pointing to Playwright binary location
                "firefox": "firefox",
                "webkit": None,  # WebKit not supported on Linux
            }

Because the program would never find my chromium installation returning error Could not find google-chrome even with browser_type set to "chromium".

This is my config:

browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        use_managed_browser=True,
        browser_type="chromium",
        user_data_dir="/home/user/chrome_dir",
        use_persistent_context=True,
    )

# Set up the crawler config
        cfg = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            # remove_overlay_elements=True,
            # page_timeout=60000
        )

When I do not use headless I just get an idle browser window that does not even surf to the webpage I specified in the url parameters. The issue seems to stem from the RunConfig not being passed to the managed browser properly but likewise help is appreciated.

unclecode · 2025-01-10T12:30:43Z

@berkaygkv Thanks for trying the library and for your kind words. While I check your code, I noticed that you set a delay delay_before_return_html=125,, which means you want around a two-minute delay before returning the HTML. Is that correct? Is it your intention? I will review your code and let you know what's going on.

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

berkaygkv · 2025-01-10T13:11:25Z

@unclecode yeah it's just a dump way to debug the behavior. I realized the browser closes up automatically even though I put a breakpoint at the line: 'print(f"Successfully crawled {url}")' and I came up with this dump delay solution.

Just to note, I checked the new documentation you released yesterday (it's quite comprehensive) and followed the steps you described in identity based management section, but still the same.

Lastly I can confirm @Etherdrake 's observation: Upon code interruption with ctrl + c the interpreter throws the following:

TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'

Though I don't know if it's related to the behavior we're discussing.

unclecode · 2025-01-10T13:27:11Z

I was going to ask you check the new docs while I am checking for you, ok no worries, tomorrow I get it done for you. @berkaygkv

berkaygkv · 2025-01-10T13:31:34Z

Appreciate your time and effort. I really admire your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_persistent_context or use_managed_browser cause the browser hang forever #430

use_persistent_context or use_managed_browser cause the browser hang forever #430

berkaygkv commented Jan 8, 2025 •

edited

Loading

Etherdrake commented Jan 10, 2025 •

edited

Loading

unclecode commented Jan 10, 2025

berkaygkv commented Jan 10, 2025 •

edited

Loading

unclecode commented Jan 10, 2025

berkaygkv commented Jan 10, 2025

use_persistent_context or use_managed_browser cause the browser hang forever #430

use_persistent_context or use_managed_browser cause the browser hang forever #430

Comments

berkaygkv commented Jan 8, 2025 • edited Loading

Etherdrake commented Jan 10, 2025 • edited Loading

unclecode commented Jan 10, 2025

berkaygkv commented Jan 10, 2025 • edited Loading

unclecode commented Jan 10, 2025

berkaygkv commented Jan 10, 2025

berkaygkv commented Jan 8, 2025 •

edited

Loading

Etherdrake commented Jan 10, 2025 •

edited

Loading

berkaygkv commented Jan 10, 2025 •

edited

Loading