psf/requests-html

When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8

Open

#442 ouverte le 27 janv. 2021

Voir sur GitHub
 (3 commentaires) (0 réactions) (0 assignés)Python (976 forks)batch import
help wanted

Métriques du dépôt

Stars
 (13 555 stars)
Métriques de merge PR
 (Aucune PR mergée en 30 j)

Description

When requesting a page that is ISO-8859-1 encoded:

>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'

Expected behavior:

>>> r.html.find("option")[-1].text
'Bygge-anlæg'

As far as I can see, there are two problems:

  • r.html.encoding is incorrectly set
  • r.html.element (The PyQuery instance) does not take encoding into account at all but just assumes utf-8

Guide contributeur