ArchiveBox/ArchiveBox

`COOKIES_FILE` isn't used when fetching page titles, leading to saving captcha-page titles like "Before you continue to YouTube..."

Open

#761 opened on 2021年6月5日

GitHub で見る
 (5 comments) (1 reaction) (0 assignees)Python (19,591 stars) (1,069 forks)batch import
good first tickethelp wantedsize: easystatus: backlogwhy: functionality

説明

Describe the bug

Title becomes 'Before you continue to YouTube' instead of video title due to youtube redirects to a cookie consent form. This could be solved if you could add a cookie file to the curl command that is run.

["curl", "--silent", "--location", "--compressed", "--max-time", "60", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.76.0 (amd64-portbld-freebsd12.2)", "https://www.youtube.com/watch?v=aP8sRCun63M"]

Steps to reproduce

  1. archivebox add https://www.youtube.com/watch?v=aP8sRCun63M
  2. Title becomes 'Before you continue to YouTube' when it should be 'ArchiveBox'

Screenshots or log output

N/A

ArchiveBox version

ArchiveBox v0.6.2
Cpython FreeBSD FreeBSD-12.2-RELEASE-p6-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     ./.local/bin/archivebox                                                     
 √  PYTHON_BINARY         v3.7.10         valid     /usr/local/bin/python3.7                                                    
 √  DJANGO_BINARY         v3.1.12         valid     ./.local/lib/python3.7/site-packages/django/bin/django-admin.py             
 √  CURL_BINARY           v7.76.0         valid     /usr/local/bin/curl                                                         
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v14.16.1        valid     /usr/local/bin/node                                                         
 √  SINGLEFILE_BINARY     v0.3.13         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.1.0          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.31.1         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.05.16     valid     /home/archivebox/.local/bin/youtube-dl                                      
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/local/bin/chrome                                                       
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     ./.local/lib/python3.7/site-packages/archivebox                             
 √  TEMPLATES_DIR         3 files         valid     ./.local/lib/python3.7/site-packages/archivebox/templates                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  1 files         valid     ./~/.config/chromium                                                        
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:

コントリビューターガイド