Skip to content

Commit d1b9cbe

Browse files
authored
Merge pull request #5 from rapidclock/#a98g7-feature-add-rate-limiting
#a98g7 feature add rate limiting
2 parents da5d755 + f4f64ca commit d1b9cbe

18 files changed

+304
-83
lines changed

README.md

Lines changed: 119 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ A concurent web crawler to crawl the web.
77
- Depth Limited Crawling
88
- User specified valid protocols
99
- User buildable adapters that the crawler feeds output to.
10-
- Filter Duplicates.
11-
- Filter URLs that fail a HEAD request.
10+
- Filter Duplicates. (Default, Non-Customizable)
11+
- Filter URLs that fail a HEAD request. (Default, Non-Customizable)
1212
- User specifiable max timeout between two successive url requests.
1313
- Max Number of Links to be crawled.
1414

@@ -35,4 +35,120 @@ func main() {
3535
crawler.SetupSystem()
3636
crawler.BeginCrawling("https://www.example.com")
3737
}
38-
```
38+
```
39+
40+
### List of customizations
41+
42+
Customizations can be made by supplying the crawler an instance of `CrawlOptions`. The basic structure is shown below, with a brief explanation for each option.
43+
44+
```go
45+
type CrawlOptions struct {
46+
MaxCrawlDepth int64 // Max Depth of Crawl, 0 is the initial link.
47+
MaxCrawledUrls int64 // Max number of links to be crawled in total.
48+
StayWithinBaseHost bool // [Not-Implemented-Yet]
49+
CrawlRate int64 // Max Rate at which requests can be made (req/sec).
50+
CrawlBurstLimit int64 // Max Burst Capacity (should be atleast the crawl rate).
51+
RespectRobots bool // [Not-Implemented-Yet]
52+
IncludeBody bool // Include the Request Body (Contents of the web page) in the result of the crawl.
53+
OpAdapter OutputAdapter // A user defined crawl output handler (See next section for info).
54+
ValidProtocols []string // Valid protocols to crawl (http, https, ftp, etc.)
55+
TimeToQuit int64 // Timeout (seconds) between two attempts or requests, before the crawler quits.
56+
}
57+
```
58+
59+
A default instance of the `CrawlOptions` can be obtained by calling `octopus.GetDefaultCrawlOptions()`. This can be further customized by overriding individual properties.
60+
61+
### Output Adapters
62+
63+
An Output Adapter is the final destination of a crawler processed request. The output of the crawler is fed here, according to the customizations made before starting the crawler through the `CrawlOptions` attached to the crawler.
64+
65+
The `OutputAdapter` is a Go Interface, that has to be implemented by your(user-defined) processor.
66+
67+
```go
68+
type OutputAdapter interface {
69+
Consume() *NodeChSet
70+
}
71+
```
72+
73+
The user has to implement the `Consume()` method that returns a __*pointer*__ to a `NodeChSet`. The `NodeChSet` is described below. The crawler uses the returned channel to send the crawl output. The user can start listening for output from the crawler.
74+
75+
**Note** : If the user chooses to implement their custom `OutputAdapter` **REMEMBER** to listen for the output on another go-routine. Otherwise you might block the crawler from running. Atleast begin the crawling on another go-routine before you begin processing output.
76+
77+
The structure of the `NodeChSet` is given below.
78+
79+
```go
80+
type NodeChSet struct {
81+
NodeCh chan<- *Node
82+
*StdChannels
83+
}
84+
85+
type StdChannels struct {
86+
QuitCh chan<- int
87+
}
88+
89+
type Node struct {
90+
*NodeInfo
91+
Body io.ReadCloser
92+
}
93+
94+
type NodeInfo struct {
95+
ParentUrlString string
96+
UrlString string
97+
Depth int64
98+
}
99+
```
100+
101+
You can use the utility function `MakeDefaultNodeChSet()` to get a `NodeChSet` built for you. This also returns the `Node` and quit channels. Example given below:
102+
103+
```go
104+
var opNodeChSet *NodeChSet
105+
var nodeCh chan *Node
106+
var quitCh chan int
107+
// above to demo the types. One can easily use go lang type erasure.
108+
opNodeChSet, nodeCh, quitCh = MakeDefaultNodeChSet()
109+
```
110+
111+
The user should supply the custom OutputAdapter as an argument to the `CrawlOptions`.
112+
113+
#### Default Output Adapters:
114+
115+
We supply two default Adapters for you to try out. They are not meant to be feature rich, but you can still use them. Their primary purpose is meant to be a demonstration of how to build and use a `OutputAdapter`.
116+
117+
1. `adapter.StdOpAdapter` : Writes the crawled output (only links, not body) to the standard output.
118+
1. `adapter.FileWriterAdapter` : Writes the crawled output (only links, not body) to a supplied file.
119+
120+
#### Implementation of the `adapter.StdOpAdapter`:
121+
We have supplied the implementation of `adapter.StdOpAdapter` below to get a rough idea of what goes into building your own adapter.
122+
123+
```go
124+
// StdOpAdapter is an output adapter that just prints the output onto the
125+
// screen.
126+
//
127+
// Sample Output Format is:
128+
// LinkNum - Depth - Url
129+
type StdOpAdapter struct{}
130+
131+
func (s *StdOpAdapter) Consume() *oct.NodeChSet {
132+
listenCh := make(chan *oct.Node)
133+
quitCh := make(chan int, 1)
134+
listenChSet := &oct.NodeChSet{
135+
NodeCh: listenCh,
136+
StdChannels: &oct.StdChannels{
137+
QuitCh: quitCh,
138+
},
139+
}
140+
go func() {
141+
i := 1
142+
for {
143+
select {
144+
case output := <-listenCh:
145+
fmt.Printf("%d - %d - %s\n", i, output.Depth, output.UrlString)
146+
i++
147+
case <-quitCh:
148+
return
149+
}
150+
}
151+
}()
152+
return listenChSet
153+
}
154+
```

adapter/basicadapters.go

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,11 @@ func (fw *FileWriterAdapter) writeToFile(listenCh chan *oct.Node,
7373
for {
7474
select {
7575
case output := <-listenCh:
76-
fmt.Fprintf(fp, "%d - %s\n", output.Depth, output.UrlString)
76+
_, err = fmt.Fprintf(fp, "%d - %s\n", output.Depth,
77+
output.UrlString)
78+
if err != nil {
79+
log.Println("File Error - ", err)
80+
}
7781
case <-quitCh:
7882
return
7983
}

octopus/core.go

Lines changed: 26 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -2,37 +2,9 @@ package octopus
22

33
import (
44
"fmt"
5-
"log"
65
"time"
76
)
87

9-
func (o *octopus) setupOctopus() {
10-
o.setupValidProtocolMap()
11-
o.setupTimeToQuit()
12-
o.setupMaxLinksCrawled()
13-
}
14-
15-
func (o *octopus) setupValidProtocolMap() {
16-
o.isValidProtocol = make(map[string]bool)
17-
for _, protocol := range o.ValidProtocols {
18-
o.isValidProtocol[protocol] = true
19-
}
20-
}
21-
22-
func (o *octopus) setupTimeToQuit() {
23-
if o.TimeToQuit > 0 {
24-
o.timeToQuit = time.Duration(o.TimeToQuit) * time.Second
25-
} else {
26-
log.Fatalln("TimeToQuit is not greater than 0")
27-
}
28-
}
29-
30-
func (o *octopus) setupMaxLinksCrawled() {
31-
if o.MaxCrawledUrls == 0 {
32-
panic("MaxCrawledUrls should either be negative or greater than 0.")
33-
}
34-
}
35-
368
func (o *octopus) SetupSystem() {
379
o.isReady = false
3810
o.setupOctopus()
@@ -57,16 +29,12 @@ func (o *octopus) SetupSystem() {
5729
depthLimitChSet := o.makeCrawlDepthFilterPipe(pageParseChSet)
5830
maxDelayChSet := o.makeMaxDelayPipe(depthLimitChSet)
5931

60-
var distributorChSet *NodeChSet
61-
if o.MaxCrawledUrls < 0 {
62-
distributorChSet = o.makeDistributorPipe(maxDelayChSet, outAdapterChSet)
63-
} else {
64-
maxLinksCrawledChSet := o.makeLimitCrawlPipe(outAdapterChSet)
65-
distributorChSet = o.makeDistributorPipe(maxDelayChSet, maxLinksCrawledChSet)
66-
}
32+
distributorChSet := o.handleDistributorPipeline(maxDelayChSet, outAdapterChSet)
6733

6834
pageReqChSet := o.makePageRequisitionPipe(distributorChSet)
69-
invUrlFilterChSet := o.makeInvalidUrlFilterPipe(pageReqChSet)
35+
36+
invUrlFilterChSet := o.handleRateLimitingPipeline(pageReqChSet)
37+
7038
dupFilterChSet := o.makeDuplicateUrlFilterPipe(invUrlFilterChSet)
7139
protoFilterChSet := o.makeUrlProtocolFilterPipe(dupFilterChSet)
7240
linkAbsChSet := o.makeLinkAbsolutionPipe(protoFilterChSet)
@@ -77,6 +45,28 @@ func (o *octopus) SetupSystem() {
7745
o.isReady = true
7846
}
7947

48+
func (o *octopus) handleDistributorPipeline(maxDelayChSet, outAdapterChSet *NodeChSet) *NodeChSet {
49+
var distributorChSet *NodeChSet
50+
if o.MaxCrawledUrls < 0 {
51+
distributorChSet = o.makeDistributorPipe(maxDelayChSet, outAdapterChSet)
52+
} else {
53+
maxLinksCrawledChSet := o.makeCrawlLinkCountLimitPipe(outAdapterChSet)
54+
distributorChSet = o.makeDistributorPipe(maxDelayChSet, maxLinksCrawledChSet)
55+
}
56+
return distributorChSet
57+
}
58+
59+
func (o *octopus) handleRateLimitingPipeline(pageReqChSet *NodeChSet) *NodeChSet {
60+
var invUrlFilterChSet *NodeChSet
61+
if o.rateLimiter != nil {
62+
rateLimitingChSet := o.makeRateLimitingPipe(pageReqChSet)
63+
invUrlFilterChSet = o.makeInvalidUrlFilterPipe(rateLimitingChSet)
64+
} else {
65+
invUrlFilterChSet = o.makeInvalidUrlFilterPipe(pageReqChSet)
66+
}
67+
return invUrlFilterChSet
68+
}
69+
8070
func (o *octopus) BeginCrawling(baseUrlStr string) {
8171
if !o.isReady {
8272
panic("Call BuildSystem first to setup Octopus")

octopus/doc.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,15 @@ The overview of the Pipeline is given below:
2424
3. Protocol Filter
2525
4. Duplicate Filter
2626
5. Invalid Url Filter (Urls whose HEAD request Fails)
27-
6. Make GET Request
27+
(5x) (Optional) Crawl Rate Limiter.
28+
[6]. Make GET Request
2829
7a. Send to Output Adapter
2930
7b. Check for Timeout (gap between two output on this channel).
3031
8. Max Links Crawled Limit Filter
3132
9. Depth Limit Filter
3233
10. Parse Page for more URLs.
3334
3435
Note: The output from 7b. is fed to 8.
35-
1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7b -> 8 -> 9 -> 10 -> 1
36-
*/
36+
1 -> 2 -> 3 -> 4 -> 5 -> (5x) -> [6] -> 7b -> 8 -> 9 -> 10 -> 1
37+
*/
3738
package octopus

octopus/modelfactory.go

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,13 @@ package octopus
33
import "sync"
44

55
const (
6-
defaultMaxDepth int64 = 2
7-
anchorTag = "a"
8-
anchorAttrb = "href"
9-
defaultTimeToQuit = 5
10-
defaultCrawlLimit int64 = -1
6+
defaultMaxDepth int64 = 2
7+
anchorTag = "a"
8+
anchorAttrb = "href"
9+
defaultTimeToQuit = 30
10+
defaultLinkCrawlLimit int64 = -1
11+
defaultCrawlRateLimit int64 = -1
12+
defaultRequestTimeout uint64 = 15
1113
)
1214

1315
// NewWithDefaultOptions - Create an Instance of the Octopus with the default CrawlOptions.
@@ -44,15 +46,16 @@ func createNode(parentUrlStr, urlStr string, depth int64) *Node {
4446
// Returns an instance of CrawlOptions with the values set to sensible defaults.
4547
func GetDefaultCrawlOptions() *CrawlOptions {
4648
return &CrawlOptions{
47-
MaxCrawlDepth: defaultMaxDepth,
48-
MaxCrawledUrls: defaultCrawlLimit,
49-
StayWithinBaseHost: false,
50-
CrawlRate: -1,
51-
RespectRobots: false,
52-
IncludeBody: true,
53-
OpAdapter: nil,
54-
ValidProtocols: []string{"http", "https"},
55-
TimeToQuit: defaultTimeToQuit,
49+
MaxCrawlDepth: defaultMaxDepth,
50+
MaxCrawledUrls: defaultLinkCrawlLimit,
51+
StayWithinBaseHost: false,
52+
CrawlRatePerSec: defaultCrawlRateLimit,
53+
CrawlBurstLimitPerSec: defaultCrawlRateLimit,
54+
RespectRobots: false,
55+
IncludeBody: true,
56+
OpAdapter: nil,
57+
ValidProtocols: []string{"http", "https"},
58+
TimeToQuit: defaultTimeToQuit,
5659
}
5760
}
5861

@@ -65,3 +68,10 @@ func MakeNodeChSet(nodeCh chan<- *Node, quitCh chan<- int) *NodeChSet {
6568
},
6669
}
6770
}
71+
72+
// Utility to create a NodeChSet and get full access to the Quit & Node Channel.
73+
func MakeDefaultNodeChSet() (*NodeChSet, chan *Node, chan int) {
74+
nodeCh := make(chan *Node)
75+
quitCh := make(chan int)
76+
return MakeNodeChSet(nodeCh, quitCh), nodeCh, quitCh
77+
}

octopus/models.go

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ import (
44
"io"
55
"sync"
66
"time"
7+
8+
"golang.org/x/time/rate"
79
)
810

911
// octopus is a concurrent web crawler.
@@ -20,6 +22,8 @@ type octopus struct {
2022
inputUrlStrChan chan string
2123
masterQuitCh chan int
2224
crawledUrlCounter int64
25+
rateLimiter *rate.Limiter
26+
requestTimeout uint64
2327
}
2428

2529
// CrawlOptions is used to house options for crawling.
@@ -37,8 +41,11 @@ type octopus struct {
3741
// StayWithinBaseHost - (unimplemented) Ensures crawler stays within the
3842
// level 1 link's hostname.
3943
//
40-
// CrawlRate (unimplemented) is the rate at which requests will be made.
41-
// In seconds
44+
// CrawlRatePerSec - is the rate at which requests will be made (per second).
45+
// If this is negative, Crawl feature will be ignored. Default is negative.
46+
//
47+
// CrawlBurstLimitPerSec - Represents the max burst capacity with which requests
48+
// can be made. This must be greater than or equal to the CrawlRatePerSec.
4249
//
4350
// RespectRobots (unimplemented) choose whether to respect robots.txt or not.
4451
//
@@ -54,15 +61,16 @@ type octopus struct {
5461
// TimeToQuit - represents the total time to wait between two new nodes to be
5562
// generated before the crawler quits. This is in seconds.
5663
type CrawlOptions struct {
57-
MaxCrawlDepth int64
58-
MaxCrawledUrls int64
59-
StayWithinBaseHost bool
60-
CrawlRate int64
61-
RespectRobots bool
62-
IncludeBody bool
63-
OpAdapter OutputAdapter
64-
ValidProtocols []string
65-
TimeToQuit int64
64+
MaxCrawlDepth int64
65+
MaxCrawledUrls int64
66+
StayWithinBaseHost bool
67+
CrawlRatePerSec int64
68+
CrawlBurstLimitPerSec int64
69+
RespectRobots bool
70+
IncludeBody bool
71+
OpAdapter OutputAdapter
72+
ValidProtocols []string
73+
TimeToQuit int64
6674
}
6775

6876
// NodeInfo is used to represent each crawled link and its associated crawl depth.

octopus/pipe_augment_linkabsolution.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ import (
66
)
77

88
func (o *octopus) makeLinkAbsolutionPipe(outChSet *NodeChSet) *NodeChSet {
9-
return stdLinearNodeFunc(makeLinkAbsolute, outChSet)
9+
return stdLinearNodeFunc(makeLinkAbsolute, outChSet, "Link Absolution")
1010
}
1111

1212
func makeLinkAbsolute(node *Node, outChSet *NodeChSet) {

octopus/pipe_ctrl_limitcrawl.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ import (
44
"sync/atomic"
55
)
66

7-
func (o *octopus) makeLimitCrawlPipe(inChSet *NodeChSet) *NodeChSet {
8-
return stdLinearNodeFunc(o.checkWithinLimit, inChSet)
7+
func (o *octopus) makeCrawlLinkCountLimitPipe(inChSet *NodeChSet) *NodeChSet {
8+
return stdLinearNodeFunc(o.checkWithinLimit, inChSet, "Crawl Link Limit")
99
}
1010

1111
func (o *octopus) checkWithinLimit(node *Node, outChSet *NodeChSet) {

0 commit comments

Comments
 (0)