- Scheduler - Interface in us.codecraft.webmagic.scheduler
-
Scheduler is the part of url management.
You can implement interface Scheduler to do:
manage urls to fetch
remove duplicate urls
- scheduler - Variable in class us.codecraft.webmagic.Spider
-
- scheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
set scheduler for Spider
- select(String) - Method in class us.codecraft.webmagic.selector.AndSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.Html
-
- select(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.OrSelector
-
- select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.PlainText
-
- select(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
-
- select(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract single result in text.
If there are more than one result, only the first will be chosen.
- select(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- Selectable - Interface in us.codecraft.webmagic.selector
-
Selectable text.
- selectGroup(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectGroupList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.AndSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.Html
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.OrSelector
-
- selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.PlainText
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
-
- selectList(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract all results in text.
- selectList(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- Selector - Interface in us.codecraft.webmagic.selector
-
Selector(extractor) for text.
- SelectorFactory - Class in us.codecraft.webmagic.selector
-
Selector factory with some inner cache.
- SelectorFactory() - Constructor for class us.codecraft.webmagic.selector.SelectorFactory
-
- setAcceptStatCode(Set<Integer>) - Method in class us.codecraft.webmagic.Site
-
Set acceptStatCode.
When status code of http response is in acceptStatCodes, it will be processed.
{200} by default.
It is not necessarily to be set.
- setCharset(String) - Method in class us.codecraft.webmagic.Site
-
Set charset of page manually.
When charset is not set or set to null, it can be auto detected by Http header.
- setDomain(String) - Method in class us.codecraft.webmagic.Site
-
set the domain of site.
- setDownloader(Downloader) - Method in class us.codecraft.webmagic.Spider
-
set the downloader of spider
- setDownloaderWhenFileMiss(Downloader) - Method in class us.codecraft.webmagic.downloader.FileCache
-
- setExtras(Map<String, Object>) - Method in class us.codecraft.webmagic.Request
-
- setHtml(Selectable) - Method in class us.codecraft.webmagic.Page
-
- setPath(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
-
- setPriority(double) - Method in class us.codecraft.webmagic.Request
-
Set the priority of request for sorting.
Need a scheduler supporting priority.
But no scheduler in webmagic supporting priority now (:
- setRequest(Request) - Method in class us.codecraft.webmagic.Page
-
- setRequest(Request) - Method in class us.codecraft.webmagic.ResultItems
-
- setRetryTimes(int) - Method in class us.codecraft.webmagic.Site
-
Set retry times when download fail, 0 by default.
- setScheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
set scheduler for Spider
- setSkip(boolean) - Method in class us.codecraft.webmagic.Page
-
- setSkip(boolean) - Method in class us.codecraft.webmagic.ResultItems
-
Set whether to skip the result.
Result which is skipped will not be processed by Pipeline.
- setSleepTime(int) - Method in class us.codecraft.webmagic.Site
-
Set the interval between the processing of two pages.
Time unit is micro seconds.
- setThread(int) - Method in interface us.codecraft.webmagic.downloader.Downloader
-
Tell the downloader how many threads the spider used.
- setThread(int) - Method in class us.codecraft.webmagic.downloader.FileCache
-
- setThread(int) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
-
- setUrl(Selectable) - Method in class us.codecraft.webmagic.Page
-
- setUrl(String) - Method in class us.codecraft.webmagic.Request
-
- setUserAgent(String) - Method in class us.codecraft.webmagic.Site
-
set user agent
- setUUID(String) - Method in class us.codecraft.webmagic.Spider
-
Set an uuid for spider.
Default uuid is domain of site.
- SimplePageProcessor - Class in us.codecraft.webmagic.processor
-
A simple PageProcessor.
- SimplePageProcessor(String, String) - Constructor for class us.codecraft.webmagic.processor.SimplePageProcessor
-
- Site - Class in us.codecraft.webmagic
-
Object contains setting for crawler.
- Site() - Constructor for class us.codecraft.webmagic.Site
-
- site - Variable in class us.codecraft.webmagic.Spider
-
- sleep(int) - Method in class us.codecraft.webmagic.Spider
-
- smartContent() - Method in class us.codecraft.webmagic.selector.Html
-
- smartContent() - Method in class us.codecraft.webmagic.selector.PlainText
-
- smartContent() - Method in interface us.codecraft.webmagic.selector.Selectable
-
select smart content with ReadAbility algorithm
- SmartContentSelector - Class in us.codecraft.webmagic.selector
-
Extract the text content of html.
Using Readability algorithm: find parents of all p tags.
- SmartContentSelector() - Constructor for class us.codecraft.webmagic.selector.SmartContentSelector
-
- Spider - Class in us.codecraft.webmagic
-
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
- Spider(PageProcessor) - Constructor for class us.codecraft.webmagic.Spider
-
create a spider with pageProcessor.
- startUrls - Variable in class us.codecraft.webmagic.Spider
-
- startUrls(List<String>) - Method in class us.codecraft.webmagic.Spider
-
Set startUrls of Spider.
Prior to startUrls of Site.
- stat - Variable in class us.codecraft.webmagic.Spider
-
- STAT_INIT - Static variable in class us.codecraft.webmagic.Spider
-
- STAT_RUNNING - Static variable in class us.codecraft.webmagic.Spider
-
- STAT_STOPPED - Static variable in class us.codecraft.webmagic.Spider
-
- strings - Variable in class us.codecraft.webmagic.selector.PlainText
-