$ A C D E F G H I J K L M N O P Q R S T U V X 

$

$(String) - Method in class us.codecraft.webmagic.selector.Html
 
$(String) - Method in class us.codecraft.webmagic.selector.PlainText
 
$(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with css selector

A

addCookie(String, String) - Method in class us.codecraft.webmagic.Site
Add a cookie with domain Site.getDomain()
addPageModel(PageModelPipeline, Class...) - Method in class us.codecraft.webmagic.model.OOSpider
 
addPipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
add a pipeline for Spider
addRequest(Page) - Method in class us.codecraft.webmagic.Spider
 
addStartUrl(String) - Method in class us.codecraft.webmagic.Site
Add a url to start url.
addTargetRequest(String) - Method in class us.codecraft.webmagic.Page
add url to fetch
addTargetRequest(Request) - Method in class us.codecraft.webmagic.Page
add requests to fetch
addTargetRequests(List<String>) - Method in class us.codecraft.webmagic.Page
add urls to fetch
AfterExtractor - Interface in us.codecraft.webmagic.model
Interface to be implemented by page models that need to do something after fields are extracted.
afterProcess(Page) - Method in interface us.codecraft.webmagic.model.AfterExtractor
 
all() - Method in class us.codecraft.webmagic.selector.PlainText
 
all() - Method in interface us.codecraft.webmagic.selector.Selectable
multi string result
AndSelector - Class in us.codecraft.webmagic.selector
All selectors will be arranged as a pipeline.
AndSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.AndSelector
 
AndSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.AndSelector
 

C

canonicalizeUrl(String, String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
canonicalizeUrl
checkAndMakeParentDirecotry(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
checkComponent() - Method in class us.codecraft.webmagic.Spider
 
checkIfNotRunning() - Method in class us.codecraft.webmagic.Spider
 
clearPipeline() - Method in class us.codecraft.webmagic.Spider
clear the pipelines set
combine(MultiPageModel) - Method in interface us.codecraft.webmagic.MultiPageModel
Combine multiPageModels to a whole object.
ComboExtract - Annotation Type in us.codecraft.webmagic.model.annotation
Combo 'ExtractBy' extractor with and/or operator.
ComboExtract.Op - Enum in us.codecraft.webmagic.model.annotation
 
ComboExtract.Source - Enum in us.codecraft.webmagic.model.annotation
types of source for extracting.
ConsolePageModelPipeline - Class in us.codecraft.webmagic.model
Print page model in console.
Usually used in test.
ConsolePageModelPipeline() - Constructor for class us.codecraft.webmagic.model.ConsolePageModelPipeline
 
ConsolePipeline - Class in us.codecraft.webmagic.pipeline
Write results in console.
Usually used in test.
ConsolePipeline() - Constructor for class us.codecraft.webmagic.pipeline.ConsolePipeline
 
create(Site, Class...) - Static method in class us.codecraft.webmagic.model.OOSpider
 
create(Site, PageModelPipeline, Class...) - Static method in class us.codecraft.webmagic.model.OOSpider
 
create(String) - Static method in class us.codecraft.webmagic.selector.Html
 
create(String) - Static method in class us.codecraft.webmagic.selector.PlainText
 
create(PageProcessor) - Static method in class us.codecraft.webmagic.Spider
create a spider with pageProcessor.
CssSelector - Class in us.codecraft.webmagic.selector
CSS selector.
CssSelector(String) - Constructor for class us.codecraft.webmagic.selector.CssSelector
 

D

DEFAULT_CLAZZ - Static variable in class us.codecraft.webmagic.utils.MultiKeyMapBase
 
destroy() - Method in class us.codecraft.webmagic.Spider
 
DoubleKeyMap<K1,K2,V> - Class in us.codecraft.webmagic.utils
 
DoubleKeyMap() - Constructor for class us.codecraft.webmagic.utils.DoubleKeyMap
 
DoubleKeyMap(Map<K1, Map<K2, V>>) - Constructor for class us.codecraft.webmagic.utils.DoubleKeyMap
 
DoubleKeyMap(Class<? extends Map>) - Constructor for class us.codecraft.webmagic.utils.DoubleKeyMap
 
DoubleKeyMap(Map<K1, Map<K2, V>>, Class<? extends Map>) - Constructor for class us.codecraft.webmagic.utils.DoubleKeyMap
init map with protoMapClass
download(Request, Task) - Method in interface us.codecraft.webmagic.downloader.Downloader
Downloads web pages and store in Page object.
download(Request, Task) - Method in class us.codecraft.webmagic.downloader.FileCache
 
download(String) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
A simple method to download a url.
download(Request, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
Downloader - Interface in us.codecraft.webmagic.downloader
Downloader is the part that downloads web pages and store in Page object.
downloader - Variable in class us.codecraft.webmagic.Spider
 
downloader(Downloader) - Method in class us.codecraft.webmagic.Spider
Deprecated. 

E

equals(Object) - Method in class us.codecraft.webmagic.Request
 
equals(Object) - Method in class us.codecraft.webmagic.Site
 
executorService - Variable in class us.codecraft.webmagic.Spider
 
Experimental - Annotation Type in us.codecraft.webmagic.utils
Stands for features unstable.
ExtractBy - Annotation Type in us.codecraft.webmagic.model.annotation
Define the extractor for field or class.
ExtractBy.Source - Enum in us.codecraft.webmagic.model.annotation
types of source for extracting.
ExtractBy.Type - Enum in us.codecraft.webmagic.model.annotation
types of extractor expressions
ExtractByUrl - Annotation Type in us.codecraft.webmagic.model.annotation
Define a extractor for url.
ExtractorUtils - Class in us.codecraft.webmagic.utils
Tools for annotation converting.
ExtractorUtils() - Constructor for class us.codecraft.webmagic.utils.ExtractorUtils
 

F

FileCache - Class in us.codecraft.webmagic.downloader
Download file and saved to file for cache.
FileCache(String, String) - Constructor for class us.codecraft.webmagic.downloader.FileCache
 
FileCache(String, String, String) - Constructor for class us.codecraft.webmagic.downloader.FileCache
 
FileCacheQueueScheduler - Class in us.codecraft.webmagic.scheduler
Store urls and cursor in files so that a Spider can resume the status when shutdown.
FileCacheQueueScheduler(String) - Constructor for class us.codecraft.webmagic.scheduler.FileCacheQueueScheduler
 
FilePersistentBase - Class in us.codecraft.webmagic.utils
Base object of file persistence.
FilePersistentBase() - Constructor for class us.codecraft.webmagic.utils.FilePersistentBase
 
FilePipeline - Class in us.codecraft.webmagic.pipeline
Store results in files.
FilePipeline() - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
create a FilePipeline with default path"/data/webmagic/"
FilePipeline(String) - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
 
fixAllRelativeHrefs(String, String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 

G

get(String) - Method in class us.codecraft.webmagic.ResultItems
 
get(K1) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
get(K1, K2) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
getAcceptStatCode() - Method in class us.codecraft.webmagic.Site
get acceptStatCode
getAll() - Method in class us.codecraft.webmagic.ResultItems
 
getCharset() - Method in class us.codecraft.webmagic.Site
get charset set manually
getCharset(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getClient(Site) - Method in class us.codecraft.webmagic.downloader.HttpClientPool
 
getCookies() - Method in class us.codecraft.webmagic.Site
get cookies
getDomain() - Method in class us.codecraft.webmagic.Site
get domain
getDomain(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getExtra(String) - Method in class us.codecraft.webmagic.Request
 
getExtras() - Method in class us.codecraft.webmagic.Request
 
getFile(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
getHost(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getHtml() - Method in class us.codecraft.webmagic.Page
get html content of page
getInstance(int) - Static method in class us.codecraft.webmagic.downloader.HttpClientPool
 
getInstatnce() - Static method in class us.codecraft.webmagic.selector.SelectorFactory
 
getOtherPages() - Method in interface us.codecraft.webmagic.MultiPageModel
other pages to be extracted.
It is used to judge whether an object contains more than one page, and whether the pages of the object are all extracted.
getPage() - Method in interface us.codecraft.webmagic.MultiPageModel
page is the identifier of a page in pages for one object.
getPageKey() - Method in interface us.codecraft.webmagic.MultiPageModel
Page key is the identifier for the object.
getPath() - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
getPriority() - Method in class us.codecraft.webmagic.Request
 
getRequest() - Method in class us.codecraft.webmagic.Page
get request of current page
getRequest() - Method in class us.codecraft.webmagic.ResultItems
 
getResultItems() - Method in class us.codecraft.webmagic.Page
 
getRetryTimes() - Method in class us.codecraft.webmagic.Site
Get retry times when download fail, 0 by default.
getSelector(ExtractBy) - Static method in class us.codecraft.webmagic.utils.ExtractorUtils
 
getSelectors(ExtractBy[]) - Static method in class us.codecraft.webmagic.utils.ExtractorUtils
 
getSite() - Method in class us.codecraft.webmagic.downloader.FileCache
 
getSite() - Method in interface us.codecraft.webmagic.processor.PageProcessor
get the site settings
getSite() - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
 
getSite() - Method in class us.codecraft.webmagic.Spider
 
getSite() - Method in interface us.codecraft.webmagic.Task
site of a task
getSleepTime() - Method in class us.codecraft.webmagic.Site
Get the interval between the processing of two pages.
Time unit is micro seconds.
getStartUrls() - Method in class us.codecraft.webmagic.Site
get start urls
getTargetRequests() - Method in class us.codecraft.webmagic.Page
 
getUrl() - Method in class us.codecraft.webmagic.Page
get url of current page
getUrl() - Method in class us.codecraft.webmagic.Request
 
getUserAgent() - Method in class us.codecraft.webmagic.Site
get user agent
getUUID() - Method in class us.codecraft.webmagic.Spider
 
getUUID() - Method in interface us.codecraft.webmagic.Task
unique id for a task.

H

handleResponse(Request, String, HttpResponse, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
hashCode() - Method in class us.codecraft.webmagic.Request
 
hashCode() - Method in class us.codecraft.webmagic.Site
 
HasKey - Interface in us.codecraft.webmagic.model
Interface to be implemented by page mode.
Can be used to identify a page model, or be used as name of file storing the object.
HelpUrl - Annotation Type in us.codecraft.webmagic.model.annotation
Define the 'help' url patterns for class.
Html - Class in us.codecraft.webmagic.selector
Selectable plain text.
Html(List<String>) - Constructor for class us.codecraft.webmagic.selector.Html
 
Html(String) - Constructor for class us.codecraft.webmagic.selector.Html
 
HttpClientDownloader - Class in us.codecraft.webmagic.downloader
The http downloader based on HttpClient.
HttpClientDownloader() - Constructor for class us.codecraft.webmagic.downloader.HttpClientDownloader
 
HttpClientPool - Class in us.codecraft.webmagic.downloader
 

I

INSTANCE - Static variable in class us.codecraft.webmagic.downloader.HttpClientPool
 
isSkip() - Method in class us.codecraft.webmagic.ResultItems
Whether to skip the result.
Result which is skipped will not be processed by Pipeline.

J

JsonFilePageModelPipeline - Class in us.codecraft.webmagic.pipeline
Store results objects (page models) to files in JSON format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.
JsonFilePageModelPipeline() - Constructor for class us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline
new JsonFilePageModelPipeline with default path "/data/webmagic/"
JsonFilePageModelPipeline(String) - Constructor for class us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline
 
JsonFilePipeline - Class in us.codecraft.webmagic.pipeline
Store results to files in JSON format.
JsonFilePipeline() - Constructor for class us.codecraft.webmagic.pipeline.JsonFilePipeline
new JsonFilePageModelPipeline with default path "/data/webmagic/"
JsonFilePipeline(String) - Constructor for class us.codecraft.webmagic.pipeline.JsonFilePipeline
 
JsonPathSelector - Class in us.codecraft.webmagic.selector
JsonPath selector.
Used to extract content from JSON.
JsonPathSelector(String) - Constructor for class us.codecraft.webmagic.selector.JsonPathSelector
 

K

key() - Method in interface us.codecraft.webmagic.model.HasKey
 

L

links() - Method in class us.codecraft.webmagic.selector.Html
 
links() - Method in class us.codecraft.webmagic.selector.PlainText
 
links() - Method in interface us.codecraft.webmagic.selector.Selectable
select all links
logger - Variable in class us.codecraft.webmagic.Spider
 

M

me() - Static method in class us.codecraft.webmagic.Site
new a Site
MultiKeyMapBase - Class in us.codecraft.webmagic.utils
multi-key map, some basic objects *
MultiKeyMapBase() - Constructor for class us.codecraft.webmagic.utils.MultiKeyMapBase
 
MultiKeyMapBase(Class<? extends Map>) - Constructor for class us.codecraft.webmagic.utils.MultiKeyMapBase
 
MultiPageModel - Interface in us.codecraft.webmagic
Extract an object of more than one pages, such as news and articles.
MultiPagePipeline - Class in us.codecraft.webmagic.pipeline
A pipeline combines the result in more than one page together.
Used for news and articles containing more than one web page.
MultiPagePipeline() - Constructor for class us.codecraft.webmagic.pipeline.MultiPagePipeline
 

N

newAndCacheSelector(Class<T>, String...) - Method in class us.codecraft.webmagic.selector.SelectorFactory
 
newFixedThreadPool(int) - Static method in class us.codecraft.webmagic.utils.ThreadUtils
 
newMap() - Method in class us.codecraft.webmagic.utils.MultiKeyMapBase
 
newRegexSelector(String) - Method in class us.codecraft.webmagic.selector.SelectorFactory
 
newReplaceSelector(String, String) - Method in class us.codecraft.webmagic.selector.SelectorFactory
 
newSelector(Class<T>, String...) - Method in class us.codecraft.webmagic.selector.SelectorFactory
 
newSmartContentSelector() - Method in class us.codecraft.webmagic.selector.SelectorFactory
 
newXpathSelector(String) - Method in class us.codecraft.webmagic.selector.SelectorFactory
 

O

OOSpider - Class in us.codecraft.webmagic.model
The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model".
OOSpider(ModelPageProcessor) - Constructor for class us.codecraft.webmagic.model.OOSpider
 
OOSpider(PageProcessor) - Constructor for class us.codecraft.webmagic.model.OOSpider
 
OOSpider(Site, PageModelPipeline, Class...) - Constructor for class us.codecraft.webmagic.model.OOSpider
create a spider
OrSelector - Class in us.codecraft.webmagic.selector
All extractors will do extracting separately,
and the results of extractors will combined as the final result.
OrSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.OrSelector
 
OrSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.OrSelector
 

P

Page - Class in us.codecraft.webmagic
Object storing extracted result and urls to fetch.
Main method:
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(java.util.List) Page.addTargetRequest(String) add urls to fetch
Page() - Constructor for class us.codecraft.webmagic.Page
 
PageModelPipeline<T> - Interface in us.codecraft.webmagic.model
Implements PageModelPipeline to persistent your page model.
PageProcessor - Interface in us.codecraft.webmagic.processor
Interface to be implemented to customize a crawler.

In PageProcessor, you can customize:

start urls and other settings in Site
how the urls to fetch are detected
how the data are extracted and stored

pageProcessor - Variable in class us.codecraft.webmagic.Spider
 
path - Variable in class us.codecraft.webmagic.utils.FilePersistentBase
 
PATH_SEPERATOR - Static variable in class us.codecraft.webmagic.utils.FilePersistentBase
 
Pipeline - Interface in us.codecraft.webmagic.pipeline
Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent.
pipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
Deprecated. 
pipelines - Variable in class us.codecraft.webmagic.Spider
 
PlainText - Class in us.codecraft.webmagic.selector
Selectable plain text.
Can not be selected by XPath or CSS Selector.
PlainText(List<String>) - Constructor for class us.codecraft.webmagic.selector.PlainText
 
PlainText(String) - Constructor for class us.codecraft.webmagic.selector.PlainText
 
poll(Task) - Method in class us.codecraft.webmagic.scheduler.FileCacheQueueScheduler
 
poll(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
poll(Task) - Method in class us.codecraft.webmagic.scheduler.RedisScheduler
 
poll(Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
返回下一个要抓取的链接
process(ResultItems, Task) - Method in class us.codecraft.webmagic.downloader.FileCache
 
process(Page) - Method in class us.codecraft.webmagic.downloader.FileCache
 
process(Object, Task) - Method in class us.codecraft.webmagic.model.ConsolePageModelPipeline
 
process(T, Task) - Method in interface us.codecraft.webmagic.model.PageModelPipeline
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.ConsolePipeline
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.FilePipeline
 
process(Object, Task) - Method in class us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.JsonFilePipeline
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.MultiPagePipeline
 
process(ResultItems, Task) - Method in interface us.codecraft.webmagic.pipeline.Pipeline
Process extracted results.
process(Page) - Method in interface us.codecraft.webmagic.processor.PageProcessor
process the page, extract urls to fetch, extract the data and store
process(Page) - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
 
processRequest(Request) - Method in class us.codecraft.webmagic.Spider
 
push(Request, Task) - Method in class us.codecraft.webmagic.scheduler.FileCacheQueueScheduler
 
push(Request, Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
push(Request, Task) - Method in class us.codecraft.webmagic.scheduler.RedisScheduler
 
push(Request, Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
add a url to fetch
put(String, T) - Method in class us.codecraft.webmagic.ResultItems
 
put(K1, Map<K2, V>) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
put(K1, K2, V) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
putExtra(String, Object) - Method in class us.codecraft.webmagic.Request
 
putField(String, Object) - Method in class us.codecraft.webmagic.Page
store extract results

Q

QueueScheduler - Class in us.codecraft.webmagic.scheduler
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
QueueScheduler() - Constructor for class us.codecraft.webmagic.scheduler.QueueScheduler
 

R

RedisScheduler - Class in us.codecraft.webmagic.scheduler
Use Redis as url scheduler for distributed crawlers.
RedisScheduler(String) - Constructor for class us.codecraft.webmagic.scheduler.RedisScheduler
 
RedisScheduler(JedisPool) - Constructor for class us.codecraft.webmagic.scheduler.RedisScheduler
 
regex(String) - Method in class us.codecraft.webmagic.selector.PlainText
 
regex(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with regex
RegexSelector - Class in us.codecraft.webmagic.selector
Selector in regex.
RegexSelector(String) - Constructor for class us.codecraft.webmagic.selector.RegexSelector
 
remove(K1, K2) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
remove(K1) - Method in class us.codecraft.webmagic.utils.DoubleKeyMap
 
removeProtocol(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
replace(String, String) - Method in class us.codecraft.webmagic.selector.PlainText
 
replace(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
replace with regex
ReplaceSelector - Class in us.codecraft.webmagic.selector
Replace selector.
ReplaceSelector(String, String) - Constructor for class us.codecraft.webmagic.selector.ReplaceSelector
 
Request - Class in us.codecraft.webmagic
Object contains url to crawl.
It contains some additional information.
Request() - Constructor for class us.codecraft.webmagic.Request
 
Request(String) - Constructor for class us.codecraft.webmagic.Request
 
ResultItems - Class in us.codecraft.webmagic
Object contains extract results.
It is contained in Page and will be processed in pipeline.
ResultItems() - Constructor for class us.codecraft.webmagic.ResultItems
 
reversePath(String, int) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
run() - Method in class us.codecraft.webmagic.Spider
 
runAsync() - Method in class us.codecraft.webmagic.Spider
 

S

Scheduler - Interface in us.codecraft.webmagic.scheduler
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls
scheduler - Variable in class us.codecraft.webmagic.Spider
 
scheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
set scheduler for Spider
select(String) - Method in class us.codecraft.webmagic.selector.AndSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.CssSelector
 
select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.Html
 
select(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.OrSelector
 
select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.PlainText
 
select(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
select(String) - Method in interface us.codecraft.webmagic.selector.Selector
Extract single result in text.
If there are more than one result, only the first will be chosen.
select(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
Selectable - Interface in us.codecraft.webmagic.selector
Selectable text.
selectGroup(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectGroupList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.AndSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.CssSelector
 
selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.Html
 
selectList(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.OrSelector
 
selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.PlainText
 
selectList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
selectList(String) - Method in interface us.codecraft.webmagic.selector.Selector
Extract all results in text.
selectList(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
Selector - Interface in us.codecraft.webmagic.selector
Selector(extractor) for text.
SelectorFactory - Class in us.codecraft.webmagic.selector
Selector factory with some inner cache.
SelectorFactory() - Constructor for class us.codecraft.webmagic.selector.SelectorFactory
 
setAcceptStatCode(Set<Integer>) - Method in class us.codecraft.webmagic.Site
Set acceptStatCode.
When status code of http response is in acceptStatCodes, it will be processed.
{200} by default.
It is not necessarily to be set.
setCharset(String) - Method in class us.codecraft.webmagic.Site
Set charset of page manually.
When charset is not set or set to null, it can be auto detected by Http header.
setDomain(String) - Method in class us.codecraft.webmagic.Site
set the domain of site.
setDownloader(Downloader) - Method in class us.codecraft.webmagic.Spider
set the downloader of spider
setDownloaderWhenFileMiss(Downloader) - Method in class us.codecraft.webmagic.downloader.FileCache
 
setExtras(Map<String, Object>) - Method in class us.codecraft.webmagic.Request
 
setHtml(Selectable) - Method in class us.codecraft.webmagic.Page
 
setPath(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
setPriority(double) - Method in class us.codecraft.webmagic.Request
Set the priority of request for sorting.
Need a scheduler supporting priority.
But no scheduler in webmagic supporting priority now (:
setRequest(Request) - Method in class us.codecraft.webmagic.Page
 
setRequest(Request) - Method in class us.codecraft.webmagic.ResultItems
 
setRetryTimes(int) - Method in class us.codecraft.webmagic.Site
Set retry times when download fail, 0 by default.
setScheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
set scheduler for Spider
setSkip(boolean) - Method in class us.codecraft.webmagic.Page
 
setSkip(boolean) - Method in class us.codecraft.webmagic.ResultItems
Set whether to skip the result.
Result which is skipped will not be processed by Pipeline.
setSleepTime(int) - Method in class us.codecraft.webmagic.Site
Set the interval between the processing of two pages.
Time unit is micro seconds.
setThread(int) - Method in interface us.codecraft.webmagic.downloader.Downloader
Tell the downloader how many threads the spider used.
setThread(int) - Method in class us.codecraft.webmagic.downloader.FileCache
 
setThread(int) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
setUrl(Selectable) - Method in class us.codecraft.webmagic.Page
 
setUrl(String) - Method in class us.codecraft.webmagic.Request
 
setUserAgent(String) - Method in class us.codecraft.webmagic.Site
set user agent
setUUID(String) - Method in class us.codecraft.webmagic.Spider
Set an uuid for spider.
Default uuid is domain of site.
SimplePageProcessor - Class in us.codecraft.webmagic.processor
A simple PageProcessor.
SimplePageProcessor(String, String) - Constructor for class us.codecraft.webmagic.processor.SimplePageProcessor
 
Site - Class in us.codecraft.webmagic
Object contains setting for crawler.
Site() - Constructor for class us.codecraft.webmagic.Site
 
site - Variable in class us.codecraft.webmagic.Spider
 
sleep(int) - Method in class us.codecraft.webmagic.Spider
 
smartContent() - Method in class us.codecraft.webmagic.selector.Html
 
smartContent() - Method in class us.codecraft.webmagic.selector.PlainText
 
smartContent() - Method in interface us.codecraft.webmagic.selector.Selectable
select smart content with ReadAbility algorithm
SmartContentSelector - Class in us.codecraft.webmagic.selector
Extract the text content of html.
Using Readability algorithm: find parents of all p tags.
SmartContentSelector() - Constructor for class us.codecraft.webmagic.selector.SmartContentSelector
 
Spider - Class in us.codecraft.webmagic
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
Spider(PageProcessor) - Constructor for class us.codecraft.webmagic.Spider
create a spider with pageProcessor.
startUrls - Variable in class us.codecraft.webmagic.Spider
 
startUrls(List<String>) - Method in class us.codecraft.webmagic.Spider
Set startUrls of Spider.
Prior to startUrls of Site.
stat - Variable in class us.codecraft.webmagic.Spider
 
STAT_INIT - Static variable in class us.codecraft.webmagic.Spider
 
STAT_RUNNING - Static variable in class us.codecraft.webmagic.Spider
 
STAT_STOPPED - Static variable in class us.codecraft.webmagic.Spider
 
strings - Variable in class us.codecraft.webmagic.selector.PlainText
 

T

TargetUrl - Annotation Type in us.codecraft.webmagic.model.annotation
Define the url patterns for class.
Task - Interface in us.codecraft.webmagic
Interface for identifying different tasks.
test(String...) - Method in class us.codecraft.webmagic.Spider
Process specific urls without url discovering.
thread(int) - Method in class us.codecraft.webmagic.Spider
start with more than one threads
threadNum - Variable in class us.codecraft.webmagic.Spider
 
ThreadUtils - Class in us.codecraft.webmagic.utils
 
ThreadUtils() - Constructor for class us.codecraft.webmagic.utils.ThreadUtils
 
toString() - Method in class us.codecraft.webmagic.Page
 
toString() - Method in class us.codecraft.webmagic.Request
 
toString() - Method in class us.codecraft.webmagic.selector.PlainText
 
toString() - Method in class us.codecraft.webmagic.selector.RegexSelector
 
toString() - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
toString() - Method in interface us.codecraft.webmagic.selector.Selectable
single string result
toTask() - Method in class us.codecraft.webmagic.Site
 

U

UrlUtils - Class in us.codecraft.webmagic.utils
url and html utils.
UrlUtils() - Constructor for class us.codecraft.webmagic.utils.UrlUtils
 
us.codecraft.webmagic - package us.codecraft.webmagic
Main class "Spider" and models.
us.codecraft.webmagic.downloader - package us.codecraft.webmagic.downloader
Downloader is the part that downloads web pages and store in Page object.
us.codecraft.webmagic.model - package us.codecraft.webmagic.model
Page model and annotations used to customize a crawler.
us.codecraft.webmagic.model.annotation - package us.codecraft.webmagic.model.annotation
Annotations for defining a extractor.
us.codecraft.webmagic.pipeline - package us.codecraft.webmagic.pipeline
Pipeline is the persistent and offline process part of crawler.
us.codecraft.webmagic.processor - package us.codecraft.webmagic.processor
PageProcessor custom part of a crawler for specific site.
us.codecraft.webmagic.scheduler - package us.codecraft.webmagic.scheduler
Scheduler is the part of url management.
us.codecraft.webmagic.selector - package us.codecraft.webmagic.selector
Selectors for page extraction.
us.codecraft.webmagic.utils - package us.codecraft.webmagic.utils
Static utils of webmagic.
uuid - Variable in class us.codecraft.webmagic.Spider
 

V

valueOf(String) - Static method in enum us.codecraft.webmagic.model.annotation.ComboExtract.Op
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum us.codecraft.webmagic.model.annotation.ComboExtract.Source
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum us.codecraft.webmagic.model.annotation.ExtractBy.Source
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum us.codecraft.webmagic.model.annotation.ExtractBy.Type
Returns the enum constant of this type with the specified name.
values() - Static method in enum us.codecraft.webmagic.model.annotation.ComboExtract.Op
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum us.codecraft.webmagic.model.annotation.ComboExtract.Source
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum us.codecraft.webmagic.model.annotation.ExtractBy.Source
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum us.codecraft.webmagic.model.annotation.ExtractBy.Type
Returns an array containing the constants of this enum type, in the order they are declared.

X

xpath(String) - Method in class us.codecraft.webmagic.selector.Html
 
xpath(String) - Method in class us.codecraft.webmagic.selector.PlainText
 
xpath(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with xpath
XpathSelector - Class in us.codecraft.webmagic.selector
XPath selector based on HtmlCleaner.
XpathSelector(String) - Constructor for class us.codecraft.webmagic.selector.XpathSelector
 
$ A C D E F G H I J K L M N O P Q R S T U V X