Jak uzyskać scrapowe adresy URL awarii?

Question

Jak uzyskać scrapowe adresy URL awarii?

Jestem początkujący Scrappy i to jest niesamowite crawler framework znam!

W moim projekcie wysłałem ponad 90 000 zapytań, ale niektóre z nich zawiodły. Ustawiłem poziom dziennika jako INFO, i po prostu widzę pewne statystyki, ale żadnych szczegółów.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Czy Jest jakiś sposób, aby uzyskać bardziej szczegółowy raport? Na przykład pokaż te nieudane adresy URL. Dzięki!

35

python web-scraping scrapy report

Author: alecxe, 2012-12-05

Source

8 answers

Oto kolejny przykład, jak obsłużyć i zebrać 404 błędów (sprawdzanie stron pomocy github):

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field


class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()


class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_item(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

Po prostu uruchom scrapy runspider z -o output.json i zobacz listę elementów w pliku output.json.

15

Author: alecxe,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-12-19 06:53:41

Odpowiedzi od @Talvalin i @alecxe bardzo mi pomogły, ale nie wydają się przechwytywać zdarzeń pobierania, które nie generują obiektu odpowiedzi(na przykład twisted.internet.error.TimeoutError i twisted.web.http.PotentialDataLoss). Błędy te pojawiają się na zrzucie statystyk na końcu biegu, ale bez żadnych meta informacji.

Jak się dowiedziałem tutaj , błędy są śledzone przez stats.py middleware, uchwycony w metodzie DownloaderStats class ' process_exception, a konkretnie w zmiennej ex_class, która zwiększa każdy wpisz błąd w razie potrzeby, a następnie zrzuca zliczenia na koniec biegu.

Aby dopasować takie błędy do informacji z odpowiedniego obiektu request, możesz dodać unikalny łańcuch identyfikacyjny do każdego żądania( za pomocą request.meta), a następnie przeciągnąć go do metody process_exception z stats.py:

self.stats.set_value('downloader/my_errs/{0}'.format(request.meta), ex_class)

To wygeneruje unikalny ciąg znaków dla każdego błędu opartego na downloaderze, któremu nie towarzyszy odpowiedź. Następnie można zapisać zmieniony stats.py jako coś innego (np. my_stats.py), dodać go do Pobierz Plik (z odpowiednim precedensem) i wyłącz stock stats.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.my_stats.MyDownloaderStats': 850,
    'scrapy.downloadermiddleware.stats.DownloaderStats': None,
    }

Wyjście na końcu uruchomienia wygląda tak (tutaj używając meta informacji, gdzie każdy URL żądania jest mapowany na group_id i member_id oddzielone ukośnikiem, jak '0/14'):

{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3,
 'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss',
 'downloader/request_bytes': 47583,
 'downloader/request_count': 133,
 'downloader/request_method_count/GET': 133,
 'downloader/response_bytes': 3416996,
 'downloader/response_count': 130,
 'downloader/response_status_count/200': 95,
 'downloader/response_status_count/301': 24,
 'downloader/response_status_count/302': 8,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished'....}

Ta odpowiedź dotyczy błędów nie opartych na downloaderze.

10

Author: bahmait,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-04-09 16:12:16

Scrapy domyślnie ignoruje 404 i nie analizuje. Aby obsłużyć błąd 404, zrób to. Jest to bardzo proste, jeśli otrzymujesz kod błędu 404 w odpowiedzi, możesz to zrobić w bardzo łatwy sposób..... w Ustawieniach napisz

HTTPERROR_ALLOWED_CODES = [404,403]

A następnie obsłuż kod statusu odpowiedzi w funkcji parse.

 def parse(self,response):
     if response.status == 404:
         #your action on error

In settings and get response in parse function

8

Author: harivans kumar,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-12-08 06:55:24

Od wersji scrapy 0.24.6, metoda sugerowana przez alecxe nie wychwytuje błędów w początkowych adresach URL. Aby rejestrować błędy z adresami startowymi, musisz nadpisać parse_start_urls. Adaptując w tym celu odpowiedź alexce, otrzymałbyś:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field

class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()

class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.handle_response(response)

    def parse_item(self, response):
        return self.handle_response(response)

    def handle_response(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

5

Author: Louis,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 12:34:34

To jest aktualizacja na to pytanie. Wpadłem na podobny problem i musiałem użyć scrapowych sygnałów, aby wywołać funkcję w moim potoku. Edytowałem kod @ Talvalin, ale chciałem dać odpowiedź tylko dla większej jasności.

Zasadniczo powinieneś dodać self jako argument dla handle_spider_closed. Należy również wywołać dyspozytora w init, aby można było przekazać instancję spider (self) do metody handleing.

from scrapy.spider import Spider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, category=None):
        self.failed_urls = []
        # the dispatcher is now called in init
        dispatcher.connect(self.handle_spider_closed,signals.spider_closed) 


    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, spider, reason): # added self 
        self.crawler.stats.set_value('failed_urls',','.join(spider.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__,  exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

Mam nadzieję, że to pomoże każdemu z tym samym problem w przyszłości.

5

Author: Mattias,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-07-06 09:44:48

Oprócz niektórych z tych odpowiedzi, jeśli chcesz śledzić skręcone błędy, rzuciłbym okiem na użycie parametru errback obiektu Request, na którym można ustawić funkcję zwrotną do wywołania z Twisted Failure W przypadku niepowodzenia żądania. Oprócz adresu url ta metoda może umożliwić śledzenie typu awarii.

Możesz następnie zalogować adresy URL za pomocą: failure.request.url (Gdzie failure jest skręconym obiektem Failure przekazanym do errback).

# these would be in a Spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse,
                                  errback=self.handle_error)

def handle_error(self, failure):
    url = failure.request.url
    logging.error('Failure type: %s, URL: %s', failure.type,
                                               url)

The Scrappy docs give a pełny przykład jak można to zrobić, z tym że wywołania Scrapowego loggera są teraz deprecjowane , więc zaadaptowałem mój przykład do używania wbudowanego w logowania w Pythonie):

Https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

1

Author: Michael Yang,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-08-31 23:37:24

Nieudane adresy URL można przechwytywać na dwa sposoby.

Define Scrappy request with errback

class TestSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(url, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        '''handle failed url (failure.request.url)'''
        pass

Użyj sygnałów.item_dropped

class TestSpider(scrapy.Spider):
    def __init__(self):
        crawler.signals.connect(self.request_dropped, signal=signals.request_dropped)

    def request_dropped(self, request, spider):
        '''handle failed url (request.url)'''
        pass

[!Notice] Scrapy request with errback nie może złapać niektórych Auto retry failure, jak błąd połączenia, RETRY_HTTP_CODES w Ustawieniach.

0

Author: jdxin0,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-05-28 15:57:38

score 44 · Accepted Answer

Tak, to możliwe.

Dodałem listę failed_urls do mojej klasy spider i dodałem do niej adresy URL, jeśli status odpowiedzi wynosił 404(będzie to wymagało rozszerzenia o inne statusy błędów).

Następnie dodałem uchwyt, który łączy listę w pojedynczy ciąg i dodać go do statystyk, gdy pająk jest zamknięty.

Na podstawie Twoich komentarzy możliwe jest śledzenie pokręconych błędów.

from scrapy.spider import BaseSpider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(BaseSpider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, category=None):
        self.failed_urls = []

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(spider, reason):
        self.crawler.stats.set_value('failed_urls', ','.join(spider.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

    dispatcher.connect(handle_spider_closed, signals.spider_closed)

Output (statystyki downloadera/exception_count* pojawią się tylko wtedy, gdy wyjątki są faktycznie wyrzucane - symulowałem je, próbując uruchomić pająka po wyłączeniu mojej karty bezprzewodowej): {]}

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 15,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
     'downloader/request_bytes': 717,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 15209,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 2,
     'failed_url_count': 2,
     'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 2,
     'log_count/INFO': 4,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'spider_exceptions/NameError': 2,
     'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}