basic command Link to heading
create a project Link to heading
scrapy startproject <project_name>
cd <project_name>
scrapy genspider <spider_name> <url>
for url, remove www. or it will creates error.
list all spiders Link to heading
scrapy list
start shell Link to heading
scrapy shell
scrapy argument Link to heading
scrapy crawl <spider name> -a <argument name: "argument content">
def close() and csv management Link to heading
calculate most recent generated csv: csv_file = max(glob.iglob("*.csv"), key=os.path.getctime)
transfer csv to xslx using openpyxl
scrapy architecture Link to heading
scrapy.cfgxxx.py: spider itselfpipelines.py: continue or drop logic heresettings.py
basic syntax Link to heading
sometimes, <selector> is replaced by response
- tag selector:
<selctor>.xpath('//h1/a/text()') - select just text from
xpath:<selctor>.xpath('//h1/a/text()').extract() - class selector:
<selctor>.xpath('//*[@class="xxx"]') - contains:
<selector>.xpath(//*[contains(@class, "xxxx")]/@class) - following-siblings:
<selector>.xpath(//*[@id="xxx"]/following-siblings::p/...)
differences between /, //, *, @ and .
Link to heading
/: separating directly subsequent tags//: separating indirectly subsequent tags- When we use
//at the very beginning of the XPath expression, it means from the<html>tag until the tag following it. So//arefers to the first<a>tag after<html> - When we use
//in the middle of the XPath expression, it means “whatever tags in between”. So//p//arefers to the<a>tag that is indirectly inside the first<p>tag, which means there might be another parent tag like<span>or whatever, including this<a>tag. However,//p/arefers to the<a>that is directly inside the<p>tag.
- When we use
*: here refers to any tag that obey the rule in[]@: It refers to attributes that defined..: You use the dot when you are extracting from the wrapper selector not from the response.
methods to get xpath Link to heading
- google chrome
copy->copy xpath - firefox
copy->copy xpath - xpath helper (chrome extension)
- xpath tester
scrapy vs selenium Link to heading
scrapy is faster than selenium in scrape
however, selenium is for testing.