basic command Link to heading

create a project Link to heading

scrapy startproject <project_name>

cd <project_name>

scrapy genspider <spider_name> <url>

for url, remove www. or it will creates error.

list all spiders Link to heading

scrapy list

start shell Link to heading

scrapy shell

scrapy argument Link to heading

scrapy crawl <spider name> -a <argument name: "argument content">

def close() and csv management Link to heading

calculate most recent generated csv: csv_file = max(glob.iglob("*.csv"), key=os.path.getctime)

transfer csv to xslx using openpyxl

scrapy architecture Link to heading

  • scrapy.cfg
  • xxx.py: spider itself
  • pipelines.py: continue or drop logic here
  • settings.py

basic syntax Link to heading

sometimes, <selector> is replaced by response

  • tag selector: <selctor>.xpath('//h1/a/text()')
  • select just text from xpath: <selctor>.xpath('//h1/a/text()').extract()
  • class selector: <selctor>.xpath('//*[@class="xxx"]')
  • contains: <selector>.xpath(//*[contains(@class, "xxxx")]/@class)
  • following-siblings: <selector>.xpath(//*[@id="xxx"]/following-siblings::p/...)

differences between /, //, *, @ and . Link to heading

  • /: separating directly subsequent tags
  • //: separating indirectly subsequent tags
    • When we use // at the very beginning of the XPath expression, it means from the <html> tag until the tag following it. So //a refers to the first <a> tag after <html>
    • When we use // in the middle of the XPath expression, it means “whatever tags in between”. So //p//a refers to the <a> tag that is indirectly inside the first <p> tag, which means there might be another parent tag like <span> or whatever, including this <a> tag. However, //p/a refers to the <a> that is directly inside the <p> tag.
  • *: here refers to any tag that obey the rule in []
  • @: It refers to attributes that defined.
  • .: You use the dot when you are extracting from the wrapper selector not from the response.

methods to get xpath Link to heading

  1. google chrome copy->copy xpath
  2. firefox copy->copy xpath
  3. xpath helper (chrome extension)
  4. xpath tester

scrapy vs selenium Link to heading

scrapy is faster than selenium in scrape

however, selenium is for testing.

credits to: Link to heading