Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_persistent_context or use_managed_browser cause the browser hang forever #430

Open
berkaygkv opened this issue Jan 8, 2025 · 5 comments

Comments

@berkaygkv
Copy link

berkaygkv commented Jan 8, 2025

It's been a couple of days since I started using this library, awesome work thanks. I wanted to work with a consistent browser context where I have all the login history persistent across runs. To this end, I implemented the following script:

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    print(user_data_dir)
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        user_data_dir=user_data_dir,
        # use_managed_browser=True, 
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        delay_before_return_html=125,
        session_id="12312",
        magic=True,
        adjust_viewport_to_content=True,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "https://httpbin.org/#/Request_inspection/get_headers"
        
        result = await crawler.arun(
            url,
            config=run_config,
            #magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(test_news_crawl())

The script opens up a functional browser, I can navigate, interact with it and it's all in the user_data_dir I gave it. To make it short: everything is perfect as far as the browser configuration. However the script gets stuck before reaching to the arun method. It does not proceed to the execution of crawler tasks. I don't know if it's a bug or wrong implementation of the feature. I have searched previous issues and a couple of other examples but no luck. Any help appreciated.

Thank you

@Etherdrake
Copy link

Etherdrake commented Jan 10, 2025

I am currently having the same problem on Linux, my IP is banned from the website I am trying to access but I can access the website through a managed browser. When issuing a CTRL + C what I get is TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig.

Inside async_crawler_strategy_py I also had to change:

else:  # Linux
            paths = {
                "chromium": "/home/user/.cache/ms-playwright/chromium-1148/chrome-linux/chrome", # Made change here pointing to Playwright binary location
                "firefox": "firefox",
                "webkit": None,  # WebKit not supported on Linux
            }

Because the program would never find my chromium installation returning error Could not find google-chrome even with browser_type set to "chromium".

This is my config:

browser_config = BrowserConfig(
        verbose=True,
        headless=False,
        use_managed_browser=True,
        browser_type="chromium",
        user_data_dir="/home/user/chrome_dir",
        use_persistent_context=True,
    )
# Set up the crawler config
        cfg = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,  # Bypass cache for fresh scraping
            extraction_strategy=extraction_strategy,
            magic=False,
            # remove_overlay_elements=True,
            # page_timeout=60000
        )

When I do not use headless I just get an idle browser window that does not even surf to the webpage I specified in the url parameters. The issue seems to stem from the RunConfig not being passed to the managed browser properly but likewise help is appreciated.

@unclecode
Copy link
Owner

@berkaygkv Thanks for trying the library and for your kind words. While I check your code, I noticed that you set a delay delay_before_return_html=125,, which means you want around a two-minute delay before returning the HTML. Is that correct? Is it your intention? I will review your code and let you know what's going on.

@Etherdrake Would you please share the complete cond snippet on how you config and run the crawler? Thx

@berkaygkv
Copy link
Author

berkaygkv commented Jan 10, 2025

@unclecode yeah it's just a dump way to debug the behavior. I realized the browser closes up automatically even though I put a breakpoint at the line: 'print(f"Successfully crawled {url}")' and I came up with this dump delay solution.

Just to note, I checked the new documentation you released yesterday (it's quite comprehensive) and followed the steps you described in identity based management section, but still the same.

Lastly I can confirm @Etherdrake 's observation: Upon code interruption with ctrl + c the interpreter throws the following:

TypeError: BrowserManager.setup_context() missing 1 required positional argument: 'crawlerRunConfig'

Though I don't know if it's related to the behavior we're discussing.

@unclecode
Copy link
Owner

I was going to ask you check the new docs while I am checking for you, ok no worries, tomorrow I get it done for you. @berkaygkv

@berkaygkv
Copy link
Author

Appreciate your time and effort. I really admire your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants