关于crontab的问题

johnnychiahuahs

各位大德, 我的想法原是要让CL_daily_productList_P1在每一周期中, 从天猫商品搜索页面第一页中提取前42项商品详情的线索并传送到CL_daily_dealRecord1_p1。可是, 发现传送到CL_daily_dealRecord1_p1的线索, 并非每次都是商品搜索页面中的前42项商品详情线索, 而是排序在上一轮采集完的线索之后的42个线索。查了一下crawl里的变数说明, 但还是没法理解, 请问需要变更哪一个变数呢?以下附上我的crontab:

<?xml version="1.0" encoding="UTF-8"?>
<crontab>
<thread name="TmallCL">
   <parameter>
   <auto>true</auto>
   <start>5</start>
   <period>3600</period>
   <waitOnload>false</waitOnload>
   <minIdle>1</minIdle>
   <maxIdle>2</maxIdle>
   </parameter>

   <step name="renewClue">
   <theme>CL_daily_productList_P1</theme>
   </step>

   <step name="crawl">
   <theme>CL_daily_productList_P1</theme>
   <loadTimeout>15000</loadTimeout>
   <lazyCycle>3</lazyCycle>
   <updateClue>true</updateClue>
   <dupRatio>80</dupRatio>
   <timerTriggered>false</timerTriggered>
   <depth>1</depth>
   <width>1</width>
   <renew>true</renew>
   <scrollWindowRatio>-1</scrollWindowRatio>
   <scrollMorePages>0</scrollMorePages>
   <stopOnDupCont>true</stopOnDupCont>
   <allowPlugin>false</allowPlugin>
   <allowImage>false</allowImage>
   <allowJavascript>true</allowJavascript>
   <resumePageLoad>true</resumePageLoad>
   <resumeMaxCount>5</resumeMaxCount>
   </step>
   <step name="crawl">
   <theme>CL_daily_dealRecord1_p1</theme>
   <loadTimeout>15000</loadTimeout>
   <lazyCycle>3</lazyCycle>
   <updateClue>false</updateClue>
   <dupRatio>60</dupRatio>
   <timerTriggered>true</timerTriggered>
   <depth>30</depth>
   <width>42</width>
   <renew>true</renew>
   <scrollWindowRatio>-1</scrollWindowRatio>
   <scrollMorePages>5</scrollMorePages>
   <stopOnDupCont>true</stopOnDupCont>
   <allowPlugin>false</allowPlugin>
   <allowImage>false</allowImage>
   <allowJavascript>true</allowJavascript>
   <resumePageLoad>true</resumePageLoad>
   <resumeMaxCount>3</resumeMaxCount>
   </step>
</thread>
</crontab>

ym · 发表于 2016-1-6 10:28:26

第一级规则给第二级规则是按线索的采集时间顺序生成的，第二级执行时就是线索的顺序来采集的。

你第一级的depth是1，表示翻一页，采集两页的数据，给第二级生成了两页的商品线索（这里生成的线索肯定大于42啦），当执行第二级时，你的width是42，表示采集前42条商品线索，但下次再执行时，按线索生成的时间顺序来算，之前生成的线索已经超过42条，这样就会从第43条线索开始~

如果你只要第一页的商品线索，就把第一级规则的depth设为0，只采集第一页的商品线索，第二级width设为-1，这样可以把第一页的商品都爬一遍了，之前生成的线索可以去个人中心的爬虫管理的规则管理里删掉，再来重新开始。

关于crontab的问题

共 1 个关于本帖的回复最后回复于 2016-1-6 10:28

推荐板块

精彩推荐

热门话题

热门用户

	B Color Image Link Quote Code Smilies 高级模式您需要登录后才可以回帖登录 \| 立即注册回帖并转播回帖后跳转到最后一页

关于crontab的问题

共 1 个关于本帖的回复 最后回复于 2016-1-6 10:28

推荐板块

精彩推荐

热门话题

热门用户

共 1 个关于本帖的回复最后回复于 2016-1-6 10:28