You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+119-3Lines changed: 119 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,8 @@ A concurent web crawler to crawl the web.
7
7
- Depth Limited Crawling
8
8
- User specified valid protocols
9
9
- User buildable adapters that the crawler feeds output to.
10
-
- Filter Duplicates.
11
-
- Filter URLs that fail a HEAD request.
10
+
- Filter Duplicates. (Default, Non-Customizable)
11
+
- Filter URLs that fail a HEAD request. (Default, Non-Customizable)
12
12
- User specifiable max timeout between two successive url requests.
13
13
- Max Number of Links to be crawled.
14
14
@@ -35,4 +35,120 @@ func main() {
35
35
crawler.SetupSystem()
36
36
crawler.BeginCrawling("https://www.example.com")
37
37
}
38
-
```
38
+
```
39
+
40
+
### List of customizations
41
+
42
+
Customizations can be made by supplying the crawler an instance of `CrawlOptions`. The basic structure is shown below, with a brief explanation for each option.
43
+
44
+
```go
45
+
typeCrawlOptionsstruct {
46
+
MaxCrawlDepthint64// Max Depth of Crawl, 0 is the initial link.
47
+
MaxCrawledUrlsint64// Max number of links to be crawled in total.
48
+
StayWithinBaseHostbool// [Not-Implemented-Yet]
49
+
CrawlRateint64// Max Rate at which requests can be made (req/sec).
50
+
CrawlBurstLimitint64// Max Burst Capacity (should be atleast the crawl rate).
51
+
RespectRobotsbool// [Not-Implemented-Yet]
52
+
IncludeBodybool// Include the Request Body (Contents of the web page) in the result of the crawl.
53
+
OpAdapterOutputAdapter// A user defined crawl output handler (See next section for info).
54
+
ValidProtocols []string// Valid protocols to crawl (http, https, ftp, etc.)
55
+
TimeToQuitint64// Timeout (seconds) between two attempts or requests, before the crawler quits.
56
+
}
57
+
```
58
+
59
+
A default instance of the `CrawlOptions` can be obtained by calling `octopus.GetDefaultCrawlOptions()`. This can be further customized by overriding individual properties.
60
+
61
+
### Output Adapters
62
+
63
+
An Output Adapter is the final destination of a crawler processed request. The output of the crawler is fed here, according to the customizations made before starting the crawler through the `CrawlOptions` attached to the crawler.
64
+
65
+
The `OutputAdapter` is a Go Interface, that has to be implemented by your(user-defined) processor.
66
+
67
+
```go
68
+
typeOutputAdapterinterface {
69
+
Consume() *NodeChSet
70
+
}
71
+
```
72
+
73
+
The user has to implement the `Consume()` method that returns a __*pointer*__ to a `NodeChSet`. The `NodeChSet` is described below. The crawler uses the returned channel to send the crawl output. The user can start listening for output from the crawler.
74
+
75
+
**Note** : If the user chooses to implement their custom `OutputAdapter`**REMEMBER** to listen for the output on another go-routine. Otherwise you might block the crawler from running. Atleast begin the crawling on another go-routine before you begin processing output.
76
+
77
+
The structure of the `NodeChSet` is given below.
78
+
79
+
```go
80
+
typeNodeChSetstruct {
81
+
NodeChchan<- *Node
82
+
*StdChannels
83
+
}
84
+
85
+
typeStdChannelsstruct {
86
+
QuitChchan<-int
87
+
}
88
+
89
+
typeNodestruct {
90
+
*NodeInfo
91
+
Body io.ReadCloser
92
+
}
93
+
94
+
typeNodeInfostruct {
95
+
ParentUrlStringstring
96
+
UrlStringstring
97
+
Depthint64
98
+
}
99
+
```
100
+
101
+
You can use the utility function `MakeDefaultNodeChSet()` to get a `NodeChSet` built for you. This also returns the `Node` and quit channels. Example given below:
102
+
103
+
```go
104
+
varopNodeChSet *NodeChSet
105
+
varnodeChchan *Node
106
+
varquitChchanint
107
+
// above to demo the types. One can easily use go lang type erasure.
The user should supply the custom OutputAdapter as an argument to the `CrawlOptions`.
112
+
113
+
#### Default Output Adapters:
114
+
115
+
We supply two default Adapters for you to try out. They are not meant to be feature rich, but you can still use them. Their primary purpose is meant to be a demonstration of how to build and use a `OutputAdapter`.
116
+
117
+
1.`adapter.StdOpAdapter` : Writes the crawled output (only links, not body) to the standard output.
118
+
1.`adapter.FileWriterAdapter` : Writes the crawled output (only links, not body) to a supplied file.
119
+
120
+
#### Implementation of the `adapter.StdOpAdapter`:
121
+
We have supplied the implementation of `adapter.StdOpAdapter` below to get a rough idea of what goes into building your own adapter.
122
+
123
+
```go
124
+
// StdOpAdapter is an output adapter that just prints the output onto the
125
+
// screen.
126
+
//
127
+
// Sample Output Format is:
128
+
// LinkNum - Depth - Url
129
+
typeStdOpAdapterstruct{}
130
+
131
+
func(s *StdOpAdapter) Consume() *oct.NodeChSet {
132
+
listenCh:=make(chan *oct.Node)
133
+
quitCh:=make(chanint, 1)
134
+
listenChSet:= &oct.NodeChSet{
135
+
NodeCh: listenCh,
136
+
StdChannels: &oct.StdChannels{
137
+
QuitCh: quitCh,
138
+
},
139
+
}
140
+
gofunc() {
141
+
i:=1
142
+
for {
143
+
select {
144
+
caseoutput:=<-listenCh:
145
+
fmt.Printf("%d - %d - %s\n", i, output.Depth, output.UrlString)
0 commit comments