psf/requests-html

When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8

Open

#442 geöffnet am 27. Jan. 2021

Auf GitHub ansehen
 (3 Kommentare) (0 Reaktionen) (0 zugewiesene Personen)Python (976 Forks)batch import
help wanted

Repository-Metriken

Stars
 (13.555 Stars)
PR-Merge-Metriken
 (Keine gemergten PRs in 30 T)

Beschreibung

When requesting a page that is ISO-8859-1 encoded:

>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'

Expected behavior:

>>> r.html.find("option")[-1].text
'Bygge-anlæg'

As far as I can see, there are two problems:

  • r.html.encoding is incorrectly set
  • r.html.element (The PyQuery instance) does not take encoding into account at all but just assumes utf-8

Contributor Guide