From 8dd48a08e4e5dfab80b1d715fb88c4df946fd797 Mon Sep 17 00:00:00 2001 From: Andrey Rakhmatullin Date: Wed, 13 Sep 2023 20:46:32 +0400 Subject: [PATCH 1/5] Move PeriodicLog docs from Debugging to General purpose. --- docs/topics/extensions.rst | 110 +++++++++++++++++++------------------ 1 file changed, 57 insertions(+), 53 deletions(-) diff --git a/docs/topics/extensions.rst b/docs/topics/extensions.rst index 0286581c025..f7b2f37990e 100644 --- a/docs/topics/extensions.rst +++ b/docs/topics/extensions.rst @@ -350,52 +350,8 @@ full list of parameters, including examples on how to instantiate .. module:: scrapy.extensions.debug :synopsis: Extensions for debugging Scrapy -Debugging extensions --------------------- - -Stack trace dump extension -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. class:: StackTraceDump - -Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_ -signal is received. The information dumped is the following: - -1. engine status (using ``scrapy.utils.engine.get_engine_status()``) -2. live references (see :ref:`topics-leaks-trackrefs`) -3. stack trace of all threads - -After the stack trace and engine status is dumped, the Scrapy process continues -running normally. - -This extension only works on POSIX-compliant platforms (i.e. not Windows), -because the `SIGQUIT`_ and `SIGUSR2`_ signals are not available on Windows. - -There are at least two ways to send Scrapy the `SIGQUIT`_ signal: - -1. By pressing Ctrl-\ while a Scrapy process is running (Linux only?) -2. By running this command (assuming ```` is the process id of the Scrapy - process):: - - kill -QUIT - -.. _SIGUSR2: https://en.wikipedia.org/wiki/SIGUSR1_and_SIGUSR2 -.. _SIGQUIT: https://en.wikipedia.org/wiki/SIGQUIT - -Debugger extension -~~~~~~~~~~~~~~~~~~ - -.. class:: Debugger - -Invokes a :doc:`Python debugger ` inside a running Scrapy process when a `SIGUSR2`_ -signal is received. After the debugger is exited, the Scrapy process continues -running normally. - -For more info see `Debugging in Python`_. - -This extension only works on POSIX-compliant platforms (i.e. not Windows). - -.. _Debugging in Python: https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/ +.. module:: scrapy.extensions.periodic_log + :synopsis: Periodic stats logging Periodic log extension ~~~~~~~~~~~~~~~~~~~~~~ @@ -441,10 +397,10 @@ This extension periodically logs rich stat data as a JSON object:: This extension logs the following configurable sections: -- ``"delta"`` shows how some numeric stats have changed since the last stats +- ``"delta"`` shows how some numeric stats have changed since the last stats log message. - - The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They + + The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They must have ``int`` or ``float`` values. - ``"stats"`` shows the current value of some stats. @@ -453,11 +409,11 @@ This extension logs the following configurable sections: - ``"time"`` shows detailed timing data. - The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or + The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or not to show this section. -This extension logs data at the start, then on a fixed time interval -configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally +This extension logs data at the start, then on a fixed time interval +configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally right before the crawl ends. @@ -507,4 +463,52 @@ PERIODIC_LOG_TIMING_ENABLED Default: ``False`` -``True`` enables logging of timing data (i.e. the ``"time"`` section). \ No newline at end of file +``True`` enables logging of timing data (i.e. the ``"time"`` section). + + +Debugging extensions +-------------------- + +Stack trace dump extension +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. class:: StackTraceDump + +Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_ +signal is received. The information dumped is the following: + +1. engine status (using ``scrapy.utils.engine.get_engine_status()``) +2. live references (see :ref:`topics-leaks-trackrefs`) +3. stack trace of all threads + +After the stack trace and engine status is dumped, the Scrapy process continues +running normally. + +This extension only works on POSIX-compliant platforms (i.e. not Windows), +because the `SIGQUIT`_ and `SIGUSR2`_ signals are not available on Windows. + +There are at least two ways to send Scrapy the `SIGQUIT`_ signal: + +1. By pressing Ctrl-\ while a Scrapy process is running (Linux only?) +2. By running this command (assuming ```` is the process id of the Scrapy + process):: + + kill -QUIT + +.. _SIGUSR2: https://en.wikipedia.org/wiki/SIGUSR1_and_SIGUSR2 +.. _SIGQUIT: https://en.wikipedia.org/wiki/SIGQUIT + +Debugger extension +~~~~~~~~~~~~~~~~~~ + +.. class:: Debugger + +Invokes a :doc:`Python debugger ` inside a running Scrapy process when a `SIGUSR2`_ +signal is received. After the debugger is exited, the Scrapy process continues +running normally. + +For more info see `Debugging in Python`_. + +This extension only works on POSIX-compliant platforms (i.e. not Windows). + +.. _Debugging in Python: https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/ From f96a3ed5f0e5dcc2bf6849219248c3da2fb85c72 Mon Sep 17 00:00:00 2001 From: Andrey Rakhmatullin Date: Wed, 13 Sep 2023 20:46:55 +0400 Subject: [PATCH 2/5] Cover up to cddb8c15d in the release notes. --- docs/news.rst | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/docs/news.rst b/docs/news.rst index 9e758f05d49..2237697c766 100644 --- a/docs/news.rst +++ b/docs/news.rst @@ -8,6 +8,13 @@ Release notes Scrapy 2.11.0 (to be released) ------------------------------ +Highlights: + +- + +- Periodic stats logging. + + Backward-incompatible changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -20,6 +27,50 @@ Backward-incompatible changes (:issue:`5968`) +Deprecation removals +~~~~~~~~~~~~~~~~~~~~ + +- Removed the binary export mode of + :class:`~scrapy.exporters.PythonItemExporter`, deprecated in Scrapy 1.1.0. + (:issue:`6006`, :issue:`6007`) + +- Removed the ``CrawlerRunner.spiders`` attribute, deprecated in Scrapy + 1.0.0, use :attr:`CrawlerRunner.spider_loader + ` instead. (:issue:`6010`) + +New features +~~~~~~~~~~~~ + +- Added the :class:`~scrapy.extensions.periodic_log.PeriodicLog` extension + which can be enabled to log stats and/or their differences periodically. + (:issue:`5926`) + +- Links to ``.webp`` files are now ignored by :ref:`link extractors + `. (:issue:`6021`) + +Bug fixes +~~~~~~~~~ + +- :meth:`scrapy.settings.BaseSettings.getdictorlist`, used to parse + :setting:`FEED_EXPORT_FIELDS`, now handles tuple values. (:issue:`6011`, + :issue:`6013`) + +- Calls to ``datetime.utcnow()``, no longer recommended to be used, have been + replaced with calls to ``datetime.now()`` with a timezone. (:issue:`6014`) + +Documentation +~~~~~~~~~~~~~ + +- Updated a deprecated function call in a pipeline example. (:issue:`6008`, + :issue:`6009`) + +Quality assurance +~~~~~~~~~~~~~~~~~ + +- Extended typing hints. (:issue:`6003`, :issue:`6005`) + +- Other CI and pre-commit improvements. (:issue:`6002`, :issue:`6013`) + .. _release-2.10.1: Scrapy 2.10.1 (2023-08-30) From c2346b4a95e51ec1d3a255e19b49042b4598a02d Mon Sep 17 00:00:00 2001 From: Andrey Rakhmatullin Date: Fri, 15 Sep 2023 19:15:05 +0400 Subject: [PATCH 3/5] Update the release notes up to current master. --- docs/news.rst | 72 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 63 insertions(+), 9 deletions(-) diff --git a/docs/news.rst b/docs/news.rst index 2237697c766..7e26299c70c 100644 --- a/docs/news.rst +++ b/docs/news.rst @@ -10,7 +10,9 @@ Scrapy 2.11.0 (to be released) Highlights: -- +- Spiders can now modify :ref:`settings ` in their + :meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider + arguments `. - Periodic stats logging. @@ -18,14 +20,25 @@ Highlights: Backward-incompatible changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- Most of the initialization of :class:`scrapy.crawler.Crawler` instances is + now done in :meth:`~scrapy.crawler.Crawler.crawl`, so the state of + instances before that method is called is now different compared to older + Scrapy versions. We do not recommend using the + :class:`~scrapy.crawler.Crawler` instances before + :meth:`~scrapy.crawler.Crawler.crawl` is called. (:issue:`6038`) + +- :meth:`scrapy.Spider.from_crawler` is now called before the initialization + of various components previously initialized in + :meth:`scrapy.crawler.Crawler.__init__` and before the settings are + finalized and frozen. This change was needed to allow changing the settings + in :meth:`scrapy.Spider.from_crawler`. If you want to access the final + setting values in the spider code as early as possible you can do this in + :meth:`~scrapy.Spider.start_requests`. (:issue:`6038`) + - The :meth:`TextResponse.json ` method now requires the response to be in a valid JSON encoding (UTF-8, UTF-16, or - UTF-32). - - If you need to deal with JSON documents in an invalid encoding, use - ``json.loads(response.text)`` instead. - - (:issue:`5968`) + UTF-32). If you need to deal with JSON documents in an invalid encoding, + use ``json.loads(response.text)`` instead. (:issue:`6016`) Deprecation removals ~~~~~~~~~~~~~~~~~~~~ @@ -38,19 +51,55 @@ Deprecation removals 1.0.0, use :attr:`CrawlerRunner.spider_loader ` instead. (:issue:`6010`) + .. note:: If you are using this Scrapy version on Scrapy Cloud with a stack + that includes an older Scrapy version and get a "TypeError: + Unexpected options: binary" error, you may need to add + ``scrapinghub-entrypoint-scrapy > 0.14.0`` to your project + requirements or switch to a stack that includes Scrapy 2.11. + +Deprecations +~~~~~~~~~~~~ + +- Running :meth:`~scrapy.crawler.Crawler.crawl` more than once on the same + :class:`scrapy.crawler.Crawler` instance is now deprecated. (:issue:`1587`, + :issue:`6040`) + New features ~~~~~~~~~~~~ +- Changed the :class:`scrapy.crawler.Crawler` initialization order, so that + most of the initialization that previously happened in + :meth:`~scrapy.crawler.Crawler.__init__` now happens in + :meth:`~scrapy.crawler.Crawler.crawl` after the spider instance is created. + This allows spider instances to modify settings in their + :meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider + arguments `. (:issue:`1305`, :issue:`1580`, :issue:`2392`, + :issue:`3663`, :issue:`6038`) + - Added the :class:`~scrapy.extensions.periodic_log.PeriodicLog` extension which can be enabled to log stats and/or their differences periodically. (:issue:`5926`) +- Optimized the memory usage in :meth:`TextResponse.json + ` by removing unnecessary body decoding. + (:issue:`5968`, :issue:`6016`) + - Links to ``.webp`` files are now ignored by :ref:`link extractors `. (:issue:`6021`) Bug fixes ~~~~~~~~~ +- Fixed logging enabled add-ons. (:issue:`6036`) + +- Fixed :class:`~scrapy.mail.MailSender` producing invalid message bodies + when the ``charset`` argument is passed to + :meth:`~scrapy.mail.MailSender.send`. (:issue:`5096`, :issue:`5118`) + +- Fixed an exception when accessing ``self.EXCEPTIONS_TO_RETRY`` from a + subclass of :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware`. + (:issue:`6049`, :issue:`6050`) + - :meth:`scrapy.settings.BaseSettings.getdictorlist`, used to parse :setting:`FEED_EXPORT_FIELDS`, now handles tuple values. (:issue:`6011`, :issue:`6013`) @@ -67,9 +116,14 @@ Documentation Quality assurance ~~~~~~~~~~~~~~~~~ -- Extended typing hints. (:issue:`6003`, :issue:`6005`) +- Extended typing hints. (:issue:`6003`, :issue:`6005`, :issue:`6031`, + :issue:`6034`) + +- Pinned brotli_ to 1.0.9 for the PyPy tests as 1.1.0 breaks them. + (:issue:`6044`, :issue:`6045`) -- Other CI and pre-commit improvements. (:issue:`6002`, :issue:`6013`) +- Other CI and pre-commit improvements. (:issue:`6002`, :issue:`6013`, + :issue:`6046`) .. _release-2.10.1: From 2fa768399a27aca615bccfc7c466758a968f10fe Mon Sep 17 00:00:00 2001 From: Andrey Rakhmatullin Date: Fri, 15 Sep 2023 19:19:42 +0400 Subject: [PATCH 4/5] Replace the VERSION vars. --- docs/topics/settings.rst | 2 +- docs/topics/spiders.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst index 3006fb8b14d..7cdfb8768c9 100644 --- a/docs/topics/settings.rst +++ b/docs/topics/settings.rst @@ -98,7 +98,7 @@ and settings set there should use the "spider" priority explicitly: super().update_settings(settings) settings.set("SOME_SETTING", "some value", priority="spider") -.. versionadded:: VERSION +.. versionadded:: 2.11 It's also possible to modify the settings in the :meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider diff --git a/docs/topics/spiders.rst b/docs/topics/spiders.rst index 1ca7eda7b55..20452d55895 100644 --- a/docs/topics/spiders.rst +++ b/docs/topics/spiders.rst @@ -136,7 +136,7 @@ scrapy.Spider attributes in the new instance so they can be accessed later inside the spider's code. - .. versionchanged:: VERSION + .. versionchanged:: 2.11 The settings in ``crawler.settings`` can now be modified in this method, which is handy if you want to modify them based on From 528911da85f871fd0f7546d4d16bbea556793d4b Mon Sep 17 00:00:00 2001 From: Andrey Rakhmatullin Date: Mon, 18 Sep 2023 14:35:28 +0400 Subject: [PATCH 5/5] Fix/reword the release notes. --- docs/news.rst | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/docs/news.rst b/docs/news.rst index 7e26299c70c..0566ff28e5d 100644 --- a/docs/news.rst +++ b/docs/news.rst @@ -14,7 +14,7 @@ Highlights: :meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider arguments `. -- Periodic stats logging. +- Periodic logging of stats. Backward-incompatible changes @@ -47,16 +47,16 @@ Deprecation removals :class:`~scrapy.exporters.PythonItemExporter`, deprecated in Scrapy 1.1.0. (:issue:`6006`, :issue:`6007`) -- Removed the ``CrawlerRunner.spiders`` attribute, deprecated in Scrapy - 1.0.0, use :attr:`CrawlerRunner.spider_loader - ` instead. (:issue:`6010`) - .. note:: If you are using this Scrapy version on Scrapy Cloud with a stack that includes an older Scrapy version and get a "TypeError: Unexpected options: binary" error, you may need to add - ``scrapinghub-entrypoint-scrapy > 0.14.0`` to your project + ``scrapinghub-entrypoint-scrapy >= 0.14.1`` to your project requirements or switch to a stack that includes Scrapy 2.11. +- Removed the ``CrawlerRunner.spiders`` attribute, deprecated in Scrapy + 1.0.0, use :attr:`CrawlerRunner.spider_loader + ` instead. (:issue:`6010`) + Deprecations ~~~~~~~~~~~~ @@ -67,12 +67,8 @@ Deprecations New features ~~~~~~~~~~~~ -- Changed the :class:`scrapy.crawler.Crawler` initialization order, so that - most of the initialization that previously happened in - :meth:`~scrapy.crawler.Crawler.__init__` now happens in - :meth:`~scrapy.crawler.Crawler.crawl` after the spider instance is created. - This allows spider instances to modify settings in their - :meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider +- Spiders can now modify settings in their + :meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider arguments `. (:issue:`1305`, :issue:`1580`, :issue:`2392`, :issue:`3663`, :issue:`6038`)