Skip to content

Commit

Permalink
Merge pull request scrapy#6048 from wRAR/relnotes-2.11
Browse files Browse the repository at this point in the history
Release notes for 2.11.0
  • Loading branch information
wRAR authored Sep 18, 2023
2 parents 3f34a5b + 528911d commit efc594b
Show file tree
Hide file tree
Showing 4 changed files with 164 additions and 59 deletions.
109 changes: 105 additions & 4 deletions docs/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,118 @@ Release notes
Scrapy 2.11.0 (to be released)
------------------------------

Highlights:

- Spiders can now modify :ref:`settings <topics-settings>` in their
:meth:`~scrapy.Spider.from_crawler` methods, e.g. based on :ref:`spider
arguments <spiderargs>`.

- Periodic logging of stats.


Backward-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Most of the initialization of :class:`scrapy.crawler.Crawler` instances is
now done in :meth:`~scrapy.crawler.Crawler.crawl`, so the state of
instances before that method is called is now different compared to older
Scrapy versions. We do not recommend using the
:class:`~scrapy.crawler.Crawler` instances before
:meth:`~scrapy.crawler.Crawler.crawl` is called. (:issue:`6038`)

- :meth:`scrapy.Spider.from_crawler` is now called before the initialization
of various components previously initialized in
:meth:`scrapy.crawler.Crawler.__init__` and before the settings are
finalized and frozen. This change was needed to allow changing the settings
in :meth:`scrapy.Spider.from_crawler`. If you want to access the final
setting values in the spider code as early as possible you can do this in
:meth:`~scrapy.Spider.start_requests`. (:issue:`6038`)

- The :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method now
requires the response to be in a valid JSON encoding (UTF-8, UTF-16, or
UTF-32).
UTF-32). If you need to deal with JSON documents in an invalid encoding,
use ``json.loads(response.text)`` instead. (:issue:`6016`)

Deprecation removals
~~~~~~~~~~~~~~~~~~~~

- Removed the binary export mode of
:class:`~scrapy.exporters.PythonItemExporter`, deprecated in Scrapy 1.1.0.
(:issue:`6006`, :issue:`6007`)

.. note:: If you are using this Scrapy version on Scrapy Cloud with a stack
that includes an older Scrapy version and get a "TypeError:
Unexpected options: binary" error, you may need to add
``scrapinghub-entrypoint-scrapy >= 0.14.1`` to your project
requirements or switch to a stack that includes Scrapy 2.11.

- Removed the ``CrawlerRunner.spiders`` attribute, deprecated in Scrapy
1.0.0, use :attr:`CrawlerRunner.spider_loader
<scrapy.crawler.CrawlerRunner.spider_loader>` instead. (:issue:`6010`)

Deprecations
~~~~~~~~~~~~

- Running :meth:`~scrapy.crawler.Crawler.crawl` more than once on the same
:class:`scrapy.crawler.Crawler` instance is now deprecated. (:issue:`1587`,
:issue:`6040`)

New features
~~~~~~~~~~~~

- Spiders can now modify settings in their
:meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider
arguments <spiderargs>`. (:issue:`1305`, :issue:`1580`, :issue:`2392`,
:issue:`3663`, :issue:`6038`)

- Added the :class:`~scrapy.extensions.periodic_log.PeriodicLog` extension
which can be enabled to log stats and/or their differences periodically.
(:issue:`5926`)

- Optimized the memory usage in :meth:`TextResponse.json
<scrapy.http.TextResponse.json>` by removing unnecessary body decoding.
(:issue:`5968`, :issue:`6016`)

- Links to ``.webp`` files are now ignored by :ref:`link extractors
<topics-link-extractors>`. (:issue:`6021`)

Bug fixes
~~~~~~~~~

- Fixed logging enabled add-ons. (:issue:`6036`)

- Fixed :class:`~scrapy.mail.MailSender` producing invalid message bodies
when the ``charset`` argument is passed to
:meth:`~scrapy.mail.MailSender.send`. (:issue:`5096`, :issue:`5118`)

- Fixed an exception when accessing ``self.EXCEPTIONS_TO_RETRY`` from a
subclass of :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware`.
(:issue:`6049`, :issue:`6050`)

- :meth:`scrapy.settings.BaseSettings.getdictorlist`, used to parse
:setting:`FEED_EXPORT_FIELDS`, now handles tuple values. (:issue:`6011`,
:issue:`6013`)

- Calls to ``datetime.utcnow()``, no longer recommended to be used, have been
replaced with calls to ``datetime.now()`` with a timezone. (:issue:`6014`)

Documentation
~~~~~~~~~~~~~

- Updated a deprecated function call in a pipeline example. (:issue:`6008`,
:issue:`6009`)

Quality assurance
~~~~~~~~~~~~~~~~~

- Extended typing hints. (:issue:`6003`, :issue:`6005`, :issue:`6031`,
:issue:`6034`)

If you need to deal with JSON documents in an invalid encoding, use
``json.loads(response.text)`` instead.
- Pinned brotli_ to 1.0.9 for the PyPy tests as 1.1.0 breaks them.
(:issue:`6044`, :issue:`6045`)

(:issue:`5968`)
- Other CI and pre-commit improvements. (:issue:`6002`, :issue:`6013`,
:issue:`6046`)

.. _release-2.10.1:

Expand Down
110 changes: 57 additions & 53 deletions docs/topics/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -350,52 +350,8 @@ full list of parameters, including examples on how to instantiate
.. module:: scrapy.extensions.debug
:synopsis: Extensions for debugging Scrapy

Debugging extensions
--------------------

Stack trace dump extension
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. class:: StackTraceDump

Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_
signal is received. The information dumped is the following:

1. engine status (using ``scrapy.utils.engine.get_engine_status()``)
2. live references (see :ref:`topics-leaks-trackrefs`)
3. stack trace of all threads

After the stack trace and engine status is dumped, the Scrapy process continues
running normally.

This extension only works on POSIX-compliant platforms (i.e. not Windows),
because the `SIGQUIT`_ and `SIGUSR2`_ signals are not available on Windows.

There are at least two ways to send Scrapy the `SIGQUIT`_ signal:

1. By pressing Ctrl-\ while a Scrapy process is running (Linux only?)
2. By running this command (assuming ``<pid>`` is the process id of the Scrapy
process)::

kill -QUIT <pid>

.. _SIGUSR2: https://en.wikipedia.org/wiki/SIGUSR1_and_SIGUSR2
.. _SIGQUIT: https://en.wikipedia.org/wiki/SIGQUIT

Debugger extension
~~~~~~~~~~~~~~~~~~

.. class:: Debugger

Invokes a :doc:`Python debugger <library/pdb>` inside a running Scrapy process when a `SIGUSR2`_
signal is received. After the debugger is exited, the Scrapy process continues
running normally.

For more info see `Debugging in Python`_.

This extension only works on POSIX-compliant platforms (i.e. not Windows).

.. _Debugging in Python: https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/
.. module:: scrapy.extensions.periodic_log
:synopsis: Periodic stats logging

Periodic log extension
~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -441,10 +397,10 @@ This extension periodically logs rich stat data as a JSON object::

This extension logs the following configurable sections:

- ``"delta"`` shows how some numeric stats have changed since the last stats
- ``"delta"`` shows how some numeric stats have changed since the last stats
log message.
The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They

The :setting:`PERIODIC_LOG_DELTA` setting determines the target stats. They
must have ``int`` or ``float`` values.

- ``"stats"`` shows the current value of some stats.
Expand All @@ -453,11 +409,11 @@ This extension logs the following configurable sections:

- ``"time"`` shows detailed timing data.

The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or
The :setting:`PERIODIC_LOG_TIMING_ENABLED` setting determines whether or
not to show this section.

This extension logs data at the start, then on a fixed time interval
configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally
This extension logs data at the start, then on a fixed time interval
configurable through the :setting:`LOGSTATS_INTERVAL` setting, and finally
right before the crawl ends.


Expand Down Expand Up @@ -507,4 +463,52 @@ PERIODIC_LOG_TIMING_ENABLED

Default: ``False``

``True`` enables logging of timing data (i.e. the ``"time"`` section).
``True`` enables logging of timing data (i.e. the ``"time"`` section).


Debugging extensions
--------------------

Stack trace dump extension
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. class:: StackTraceDump

Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_
signal is received. The information dumped is the following:

1. engine status (using ``scrapy.utils.engine.get_engine_status()``)
2. live references (see :ref:`topics-leaks-trackrefs`)
3. stack trace of all threads

After the stack trace and engine status is dumped, the Scrapy process continues
running normally.

This extension only works on POSIX-compliant platforms (i.e. not Windows),
because the `SIGQUIT`_ and `SIGUSR2`_ signals are not available on Windows.

There are at least two ways to send Scrapy the `SIGQUIT`_ signal:

1. By pressing Ctrl-\ while a Scrapy process is running (Linux only?)
2. By running this command (assuming ``<pid>`` is the process id of the Scrapy
process)::

kill -QUIT <pid>

.. _SIGUSR2: https://en.wikipedia.org/wiki/SIGUSR1_and_SIGUSR2
.. _SIGQUIT: https://en.wikipedia.org/wiki/SIGQUIT

Debugger extension
~~~~~~~~~~~~~~~~~~

.. class:: Debugger

Invokes a :doc:`Python debugger <library/pdb>` inside a running Scrapy process when a `SIGUSR2`_
signal is received. After the debugger is exited, the Scrapy process continues
running normally.

For more info see `Debugging in Python`_.

This extension only works on POSIX-compliant platforms (i.e. not Windows).

.. _Debugging in Python: https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/
2 changes: 1 addition & 1 deletion docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ and settings set there should use the "spider" priority explicitly:
super().update_settings(settings)
settings.set("SOME_SETTING", "some value", priority="spider")
.. versionadded:: VERSION
.. versionadded:: 2.11

It's also possible to modify the settings in the
:meth:`~scrapy.Spider.from_crawler` method, e.g. based on :ref:`spider
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ scrapy.Spider
attributes in the new instance so they can be accessed later inside the
spider's code.

.. versionchanged:: VERSION
.. versionchanged:: 2.11

The settings in ``crawler.settings`` can now be modified in this
method, which is handy if you want to modify them based on
Expand Down

0 comments on commit efc594b

Please sign in to comment.