Jak pobrać plik przez HTTP za pomocą Pythona?

Question

Jak pobrać plik przez HTTP za pomocą Pythona?

Mam małe narzędzie, którego używam do pobierania MP3 ze strony internetowej zgodnie z harmonogramem, a następnie buduje / aktualizuje plik XML podcastu, który oczywiście dodałem do iTunes.

Przetwarzanie tekstu, które tworzy/aktualizuje plik XML jest napisane w Pythonie. Używam wget wewnątrz Pliku Windows .bat, aby pobrać rzeczywisty MP3 jednak. Wolałbym jednak mieć całe narzędzie napisane w Pythonie.

Starałem się jednak znaleźć sposób, aby faktycznie załadować plik w Pythonie, dlatego uciekała się do wget.

Jak mogę pobrać plik używając Pythona?

707

python http urllib

Author: kilojoules, 2008-08-22

Source

21 answers

Jeszcze jeden, używając urlretrieve:

import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(dla Pythona 3 + Użyj ' import urllib.request " I urllib.Prośba.urlretrieve)

Jeszcze jeden, z "progresbar"

import urllib2

url = "http://download.thinkbroadband.com/10MB.zip"

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

963

Author: PabloG,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-08-19 12:34:18

W 2012 roku użyj biblioteki zapytań Pythona

>>> import requests
>>> 
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

Możesz uruchomić pip install requests, aby go zdobyć.

Requests ma wiele zalet w stosunku do alternatyw, ponieważ API jest znacznie prostsze. Jest to szczególnie ważne, jeśli musisz wykonać uwierzytelnianie. urllib i urllib2 są dość nieintuicyjne i bolesne w tym przypadku.

2015-12-30

Ludzie wyrazili podziw dla paska postępu. Spoko, jasne. Obecnie istnieje kilka gotowych rozwiązań, w tym tqdm:

from tqdm import tqdm
import requests

url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)

with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

Jest to zasadniczo implementacja @ kvance opisana 30 miesięcy temu.

302

Author: hughdbrown,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-12-31 16:45:47

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

wb w open('test.mp3','wb') otwiera plik (i usuwa dowolny istniejący plik) w trybie binarnym, dzięki czemu można zapisać dane z nim zamiast tylko tekstu.

148

Author: Grant,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-03-10 17:14:43

Python 3

urllib.request.urlopen

import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

urllib.request.urlretrieve

import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

Python 2

urllib2.urlopen (thanks Corey )

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

urllib.urlretrieve (Dzięki PabloG )

import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

69

Author: bmaupin,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-01-03 14:00:12

Ulepszona wersja kodu PabloG dla Pythona 2/3:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
    import urllib.request as urllib2
    import urllib.parse as urlparse
else:
    import urllib2
    import urlparse

def download_file(url, dest=None):
    """ 
    Download and save a file specified by url to dest directory,
    """
    u = urllib2.urlopen(url)

    scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
    filename = os.path.basename(path)
    if not filename:
        filename = 'downloaded.file'
    if dest:
        filename = os.path.join(dest, filename)

    with open(filename, 'wb') as f:
        meta = u.info()
        meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
        meta_length = meta_func("Content-Length")
        file_size = None
        if meta_length:
            file_size = int(meta_length[0])
        print("Downloading: {0} Bytes: {1}".format(url, file_size))

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)

            status = "{0:16}".format(file_size_dl)
            if file_size:
                status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
            status += chr(13)
            print(status, end="")
        print()

    return filename

if __name__ == "__main__":  # Only run if this file is called directly
    print("Testing with 10MB download")
    url = "http://download.thinkbroadband.com/10MB.zip"
    filename = download_file(url)
    print(filename)

18

Author: Stan,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-08-06 06:32:19

Napisał wget bibliotekę w czystym Pythonie właśnie w tym celu. Jest pompowany urlretrieve z te funkcje od wersji 2.0.

16

Author: anatoly techtonik,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-09-25 17:55:16

Użyj modułu wget:

import wget
wget.download('url')

16

Author: Sara Santana,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-03-25 12:59:25

Zgadzam się z Corey' em, urllib2 jest bardziej kompletny niż urllib i prawdopodobnie powinien być modułem używanym, jeśli chcesz robić bardziej złożone rzeczy, ale aby odpowiedzi były pełniejsze, urllib jest prostszym modułem, Jeśli chcesz tylko podstawy:

import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()

Będzie dobrze. Lub, jeśli nie chcesz zajmować się obiektem "response", możesz wywołać read () bezpośrednio:

import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

12

Author: akdom,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2008-08-22 15:58:52

Prosty, ale Python 2 & Python 3 zgodny sposób pochodzi z six biblioteką:

from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

12

Author: Akif,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-01-16 12:05:10

Poniżej znajdują się najczęściej używane wywołania do pobierania plików w Pythonie:

urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)

Uwaga: urlopen i urlretrieve są stosunkowo złe przy pobieraniu dużych plików (Rozmiar > 500 MB). requests.get przechowuje plik w pamięci do momentu zakończenia pobierania.

11

Author: Jaydev,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-09-19 12:45:10

Możesz również uzyskać informacje zwrotne o postępach za pomocą urlretrieve:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

6

Author: Marcin Cuprjak,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-01-26 13:12:54

Jeśli masz zainstalowany wget, możesz użyć parallel_sync.

Pip install parallel_sync

from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc: https://pythonhosted.org/parallel_sync/pages/examples.html

To jest dość potężne. Może pobierać pliki równolegle, ponawiać próby po awarii, a nawet pobierać pliki na zdalnym komputerze.

5

Author: max,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-11-19 23:48:06

Jeśli prędkość ma dla Ciebie znaczenie, zrobiłem mały test wydajności dla modułów urllib i wget, a jeśli chodzi o wget próbowałem raz z paskiem stanu i raz bez. Wziąłem trzy różne pliki 500MB do przetestowania (Różne pliki-aby wyeliminować szansę, że pod maską dzieje się jakieś buforowanie). Testowane na maszynie Debiana, z python2.

Po pierwsze, Są to wyniki (są podobne w różnych biegach):

$ python wget_test.py 
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============

Sposób, w jaki wykonałem test, to użycie " profilu" dekorator. To jest pełny kod:

import wget
import urllib
import time
from functools import wraps

def profile(func):
    @wraps(func)
    def inner(*args):
        print func.__name__, ": starting"
        start = time.time()
        ret = func(*args)
        end = time.time()
        print func.__name__, ": {:.2f}".format(end - start)
        return ret
    return inner

url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'

def do_nothing(*args):
    pass

@profile
def urlretrive_test(url):
    return urllib.urlretrieve(url)

@profile
def wget_no_bar_test(url):
    return wget.download(url, out='/tmp/', bar=do_nothing)

@profile
def wget_with_bar_test(url):
    return wget.download(url, out='/tmp/')

urlretrive_test(url1)
print '=============='
time.sleep(1)

wget_no_bar_test(url2)
print '=============='
time.sleep(1)

wget_with_bar_test(url3)
print '=============='
time.sleep(1)

urllib wydaje się być najszybszy

3

Author: Omer Dagan,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-11-03 14:25:38

W python3 możesz użyć urllib3 i shutil libraires. Pobierz je za pomocą pip lub pip3 (w zależności od tego, czy python3 jest domyślny, czy nie)

pip3 install urllib3 shutil

Następnie uruchom ten kod

import urllib.request
import shutil

url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

Zauważ, że pobierasz urllib3 ale używasz urllib w kodzie

3

Author: Apoorv Agarwal,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-02-08 17:37:15

Kod źródłowy może być:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource

2

Author: Sherlock Smith,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-11-26 14:25:14

Napisałem poniżej, który działa w vanilla Python 2 lub Python 3.

import sys
try:
    import urllib.request
    python3 = True
except ImportError:
    import urllib2
    python3 = False


def progress_callback_simple(downloaded,total):
    sys.stdout.write(
        "\r" +
        (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
        " [%3.2f%%]"%(100.0*float(downloaded)/float(total))
    )
    sys.stdout.flush()

def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
    def _download_helper(response, out_file, file_size):
        if progress_callback!=None: progress_callback(0,file_size)
        if block_size == None:
            buffer = response.read()
            out_file.write(buffer)

            if progress_callback!=None: progress_callback(file_size,file_size)
        else:
            file_size_dl = 0
            while True:
                buffer = response.read(block_size)
                if not buffer: break

                file_size_dl += len(buffer)
                out_file.write(buffer)

                if progress_callback!=None: progress_callback(file_size_dl,file_size)
    with open(dstfilepath,"wb") as out_file:
        if python3:
            with urllib.request.urlopen(srcurl) as response:
                file_size = int(response.getheader("Content-Length"))
                _download_helper(response,out_file,file_size)
        else:
            response = urllib2.urlopen(srcurl)
            meta = response.info()
            file_size = int(meta.getheaders("Content-Length")[0])
            _download_helper(response,out_file,file_size)

import traceback
try:
    download(
        "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
        "output.zip",
        progress_callback_simple
    )
except:
    traceback.print_exc()
    input()

Uwagi:

obsługuje funkcję callback "progress bar".
Download to test 4 MB .zip z mojej strony.

1

Author: imallett,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-13 21:52:30

Urlretrieve i prośby.get jest proste, jednak rzeczywistość nie. Pobrałem dane dla kilku stron, w tym tekst i obrazy, powyższe dwa prawdopodobnie rozwiązują większość zadań. ale dla bardziej uniwersalnego rozwiązania proponuję użycie urlopen. Ponieważ jest on zawarty w standardowej bibliotece Pythona 3, twój kod może działać na dowolnym komputerze, na którym działa Python 3 bez wstępnej instalacji site-par

import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
len_content = url_content.length

#remember to open file in bytes mode
with open(filename, 'wb') as f:
    while True:
        buffer = url_connect.read(buffer_size)
        if not buffer: break

        #an integer value of size of written data
        data_wrote = f.write(buffer)

#you could probably use with-open-as manner
url_connect.close()

Ta odpowiedź zapewnia rozwiązanie HTTP 403 zabronione podczas pobierania pliku przez http za pomocą Python. Próbowałem tylko żądań i modułów urllib, drugi moduł może zapewnić coś lepszego, ale to jest ten, którego użyłem do rozwiązania większości problemów.

0

Author: Sphynx-HenryAY,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-03-13 13:12:19

To może trochę za późno, ale widziałem kod pabloga i nie mogłem pomóc w dodaniu systemu operacyjnego.system ('cls'), aby wyglądał niesamowicie! Zobacz też:

    import urllib2,os

    url = "http://download.thinkbroadband.com/10MB.zip"

    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (file_name, file_size)
    os.system('cls')
    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

Jeśli działa w środowisku innym niż Windows, będziesz musiał użyć czegoś innego niż 'cls'. W MAC OS X i Linux powinno być "jasne".

0

Author: JD3,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-16 16:46:21

Dla kompletności, możliwe jest również wywołanie dowolnego programu do pobierania plików za pomocą pakietu subprocess. Programy dedykowane do pobierania plików są bardziej wydajne niż funkcje Pythona, takie jak urlretrieve, np.wget mogą pobierać katalogi rekurencyjnie (-R), mogą radzić sobie z FTP, przekierowaniami, proxy HTTP, mogą unikać ponownego pobierania istniejących plików (-nc) i aria2 mogą równolegle pobierać pliki.

import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

W Notatniku Jupyter można również wywoływać programy bezpośrednio za pomocą ! składnia:

!wget -O example_output_file.html https://example.com

0

Author: Robin Dinse,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-08-29 12:24:49

Możesz używać PycURL na Pythonie 2 i 3.

import pycurl

FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'

with open(FILE_DEST, 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, FILE_SRC)
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

0

Author: gzerone,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-09-10 06:01:38

score 391 · Accepted Answer

W Pythonie 2 Użyj urllib2, który jest dostarczany z biblioteką standardową.

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

Jest to najbardziej podstawowy sposób korzystania z biblioteki, bez obsługi błędów. Możesz również robić bardziej złożone rzeczy, takie jak zmiana nagłówków. Dokumentację można znaleźć tutaj.