psf/requests-html

When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8

Open

#442 aperta il 27 gen 2021

Vedi su GitHub
 (3 commenti) (0 reazioni) (0 assegnatari)Python (976 fork)batch import
help wanted

Metriche repository

Star
 (13.555 star)
Metriche merge PR
 (Nessuna PR mergiata in 30 g)

Descrizione

When requesting a page that is ISO-8859-1 encoded:

>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'

Expected behavior:

>>> r.html.find("option")[-1].text
'Bygge-anlæg'

As far as I can see, there are two problems:

  • r.html.encoding is incorrectly set
  • r.html.element (The PyQuery instance) does not take encoding into account at all but just assumes utf-8

Guida contributor