Skip to content

Commit 195b572

Browse files
authored
Added Crawl Options and Output Adapter sections
1 parent da5d755 commit 195b572

File tree

1 file changed

+119
-3
lines changed

1 file changed

+119
-3
lines changed

README.md

Lines changed: 119 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ A concurent web crawler to crawl the web.
77
- Depth Limited Crawling
88
- User specified valid protocols
99
- User buildable adapters that the crawler feeds output to.
10-
- Filter Duplicates.
11-
- Filter URLs that fail a HEAD request.
10+
- Filter Duplicates. (Default, Non-Customizable)
11+
- Filter URLs that fail a HEAD request. (Default, Non-Customizable)
1212
- User specifiable max timeout between two successive url requests.
1313
- Max Number of Links to be crawled.
1414

@@ -35,4 +35,120 @@ func main() {
3535
crawler.SetupSystem()
3636
crawler.BeginCrawling("https://www.example.com")
3737
}
38-
```
38+
```
39+
40+
### List of customizations
41+
42+
Customizations can be made by supplying the crawler an instance of `CrawlOptions`. The basic structure is shown below, with a brief explanation for each option.
43+
44+
```go
45+
type CrawlOptions struct {
46+
MaxCrawlDepth int64 // Max Depth of Crawl, 0 is the initial link.
47+
MaxCrawledUrls int64 // Max number of links to be crawled in total.
48+
StayWithinBaseHost bool // [Not-Implemented-Yet]
49+
CrawlRate int64 // Max Rate at which requests can be made (req/sec).
50+
CrawlBurstLimit int64 // Max Burst Capacity (should be atleast the crawl rate).
51+
RespectRobots bool // [Not-Implemented-Yet]
52+
IncludeBody bool // Include the Request Body (Contents of the web page) in the result of the crawl.
53+
OpAdapter OutputAdapter // A user defined crawl output handler (See next section for info).
54+
ValidProtocols []string // Valid protocols to crawl (http, https, ftp, etc.)
55+
TimeToQuit int64 // Timeout (seconds) between two attempts or requests, before the crawler quits.
56+
}
57+
```
58+
59+
A default instance of the `CrawlOptions` can be obtained by calling `octopus.GetDefaultCrawlOptions()`. This can be further customized by overriding individual properties.
60+
61+
### Output Adapters
62+
63+
An Output Adapter is the final destination of a crawler processed request. The output of the crawler is fed here, according to the customizations made before starting the crawler through the `CrawlOptions` attached to the crawler.
64+
65+
The `OutputAdapter` is a Go Interface, that has to be implemented by your(user-defined) processor.
66+
67+
```go
68+
type OutputAdapter interface {
69+
Consume() *NodeChSet
70+
}
71+
```
72+
73+
The user has to implement the `Consume()` method that returns a __*pointer*__ to a `NodeChSet`. The `NodeChSet` is described below. The crawler uses the returned channel to send the crawl output. The user can start listening for output from the crawler.
74+
75+
**Note** : If the user chooses to implement their custom `OutputAdapter` **REMEMBER** to listen for the output on another go-routine. Otherwise you might block the crawler from running. Atleast begin the crawling on another go-routine before you begin processing output.
76+
77+
The structure of the `NodeChSet` is given below.
78+
79+
```go
80+
type NodeChSet struct {
81+
NodeCh chan<- *Node
82+
*StdChannels
83+
}
84+
85+
type StdChannels struct {
86+
QuitCh chan<- int
87+
}
88+
89+
type Node struct {
90+
*NodeInfo
91+
Body io.ReadCloser
92+
}
93+
94+
type NodeInfo struct {
95+
ParentUrlString string
96+
UrlString string
97+
Depth int64
98+
}
99+
```
100+
101+
You can use the utility function `MakeDefaultNodeChSet()` to get a `NodeChSet` built for you. This also returns the `Node` and quit channels. Example given below:
102+
103+
```go
104+
var opNodeChSet *NodeChSet
105+
var nodeCh chan *Node
106+
var quitCh chan int
107+
// above to demo the types. One can easily use go lang type erasure.
108+
opNodeChSet, nodeCh, quitCh = MakeDefaultNodeChSet()
109+
```
110+
111+
The user should supply the custom OutputAdapter as an argument to the `CrawlOptions`.
112+
113+
#### Default Output Adapters:
114+
115+
We supply two default Adapters for you to try out. They are not meant to be feature rich, but you can still use them. Their primary purpose is meant to be a demonstration of how to build and use a `OutputAdapter`.
116+
117+
1. `adapter.StdOpAdapter` : Writes the crawled output (only links, not body) to the standard output.
118+
1. `adapter.FileWriterAdapter` : Writes the crawled output (only links, not body) to a supplied file.
119+
120+
#### Implementation of the `adapter.StdOpAdapter`:
121+
We have supplied the implementation of `adapter.StdOpAdapter` below to get a rough idea of what goes into building your own adapter.
122+
123+
```go
124+
// StdOpAdapter is an output adapter that just prints the output onto the
125+
// screen.
126+
//
127+
// Sample Output Format is:
128+
// LinkNum - Depth - Url
129+
type StdOpAdapter struct{}
130+
131+
func (s *StdOpAdapter) Consume() *oct.NodeChSet {
132+
listenCh := make(chan *oct.Node)
133+
quitCh := make(chan int, 1)
134+
listenChSet := &oct.NodeChSet{
135+
NodeCh: listenCh,
136+
StdChannels: &oct.StdChannels{
137+
QuitCh: quitCh,
138+
},
139+
}
140+
go func() {
141+
i := 1
142+
for {
143+
select {
144+
case output := <-listenCh:
145+
fmt.Printf("%d - %d - %s\n", i, output.Depth, output.UrlString)
146+
i++
147+
case <-quitCh:
148+
return
149+
}
150+
}
151+
}()
152+
return listenChSet
153+
}
154+
```

0 commit comments

Comments
 (0)