Some suggestions · dixudx/tumblr-crawler#85

2018-06-05T23:53:13.000Z

1. It needs to support basic form of content address like "https://tumblr.blahblah.com/blah" When approach to a certain tumblr blog with http protocol is blocked by ISP, try https:// or make it as basic form. 2. There should be a method to suppress repeating download when the download fails once. Save dummy file with the file name, for example. 3. When the address is form of "https://www.tumblr.com/dashboard/blog/blah", it skips downloads.

(3 comments) (0 reactions) (0 assignees)Python (1,144 stars) (353 forks)batch import

help wanted

Description

It needs to support basic form of content address like "https://tumblr.blahblah.com/blah" When approach to a certain tumblr blog with http protocol is blocked by ISP, try https:// or make it as basic form.
There should be a method to suppress repeating download when the download fails once. Save dummy file with the file name, for example.
When the address is form of "https://www.tumblr.com/dashboard/blog/blah", it skips downloads.

Contributor guide

Tech stack: python
Domain: backendcli
Issue type: feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: stale
Clarity: mostly clear
Prerequisites: PythonHTTP basicsURL parsing
Newbie friendliness: 40
Research direction: Review the existing URL parsing logic in the source code, likely in a Python module like `crawler.py`. Check the issue comments for additional context from the maintainer or other contributors. The three suggestions involve: (1) supporting HTTPS as the base URL format, (2) suppressing repeat downloads by saving a dummy file, and (3) correctly handling dashboard blog URLs. Implement a URL normalizer function and a download state tracker to address these. Look at the repository's test files to understand expected behavior.