You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the link of the **xpath** of the element on the page that contains the required information. A good way to get this information is to (on a chrome browser):
41
+
- Right click on the place were the information is present
42
+
- Click "Inspect" to open the Chrome developer tools window with the element highligted
43
+
- On the highlighed value in the HTML source code, `Right click -> Copy -> Copy xpath`
<imgwidth="352"alt="Screen Shot 2020-05-12 at 4 06 13 AM"src="https://user-images.githubusercontent.com/14211134/81619156-8d539080-9406-11ea-99bf-17e9e4da7e87.png" > | <imgwidth="355"alt="Screen Shot 2020-05-12 at 4 08 02 AM"src="https://user-images.githubusercontent.com/14211134/81619157-8e84bd80-9406-11ea-8941-b6c6e0dfab46.png">
50
+
51
+
The xpath in example above comes out to be: ```//*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()```
52
+
53
+
### 3. `regex` to pluck the right value
54
+
55
+
Note the the xpath above leads us to the value: *Joel Spolsky and Jeff Atwood launch Stack Overflow*
56
+
57
+
Since we want to trim that down further, we'll provide a regex value to extract just the names.
58
+
59
+
This regex will fetch just the names (the value in parenthesis):
60
+
``` ^(*.) launch .* ```
61
+
62
+
## Sample hosted invocation
63
+
64
+
`webpluck` can be run as a standalone binary. To extract the names using the three params we just obtained, copy the `targets.yml` file and populate it with the parameters. The resulting `targets.yml` should look like this:
Now invoke webpluck as follows and obtain the answer:
75
+
```bash
76
+
$ ./webpluck_osx -f /path/to/targets.yml
77
+
{
78
+
"stackoverflow_founders": "Joel Spolsky and Jeff Atwood"
79
+
}
80
+
```
81
+
82
+
## Sample API invocation
83
+
84
+
`webpluck` can be run in server mode as well. Thereafter, clients written in other programming languages can scrape web pages using the `webpluck` API over the network.
85
+
86
+
To run `webpluck` in server mode listening on localhost on 8080:
87
+
```bash
88
+
$ ./webpluck -p 8080
89
+
```
90
+
91
+
An instance of `webpluck` API is running at `https://api.code.express/webpluck/`. You can use that for your light extraction needs. If your load is heavy, consider spinning your own server running `webpluck`
92
+
93
+
Armed with the knowledge of `baseUrl`, `xpath` and `regex`, we can now call the API endpoint by POSTing these three params:
0 commit comments