basic command Link to heading

create a project Link to heading

scrapy startproject <project_name>

cd <project_name>

scrapy genspider <spider_name> <url>

for url, remove www. or it will creates error.

scrapy list

scrapy shell

scrapy crawl <spider name> -a <argument name: "argument content">

calculate most recent generated csv: csv_file = max(glob.iglob("*.csv"), key=os.path.getctime)

transfer csv to xslx using openpyxl

sometimes, <selector> is replaced by response

tag selector: <selctor>.xpath('//h1/a/text()')
select just text from xpath: <selctor>.xpath('//h1/a/text()').extract()
class selector: <selctor>.xpath('//*[@class="xxx"]')
contains: <selector>.xpath(//*[contains(@class, "xxxx")]/@class)
following-siblings: <selector>.xpath(//*[@id="xxx"]/following-siblings::p/...)

/: separating directly subsequent tags
//: separating indirectly subsequent tags
- When we use // at the very beginning of the XPath expression, it means from the <html> tag until the tag following it. So //a refers to the first <a> tag after <html>
- When we use // in the middle of the XPath expression, it means “whatever tags in between”. So //p//a refers to the <a> tag that is indirectly inside the first <p> tag, which means there might be another parent tag like <span> or whatever, including this <a> tag. However, //p/a refers to the <a> that is directly inside the <p> tag.
*: here refers to any tag that obey the rule in []
@: It refers to attributes that defined.
.: You use the dot when you are extracting from the wrapper selector not from the response.

scrapy is faster than selenium in scrape

however, selenium is for testing.