.hack//tk

.hack//tk

yet another technology blog

Creating a simple Craigslist Gatherer Bot in Javascript

Craigslist, is a website of classified advertisements created in 1995. Containing sections devoted to things for sale, items, wanted, jobs, housing, services, etc. Any city or region could have this website, such as, Vancouver’s Craigslist. Although the technology and visual of the website are outdated, many people in North America still use Craigslist. It is one of the most popular and simple ways to buy and sell things on the internet. If you are a student just like me, you will be able to find some cool stuff for free in the “free” category.

A Craigslist gatherer bot is basically a program that is going to parse, filter information of classified advertisements and send a daily digest email with the items that are of your interest directly to you. This way you can always keep an eye on good deals.

Since Craigslist does not provide an Open API, we will have to parse the HTML code and filter the information later. Note that if the HTML of the webpage change even a bit, we might have to update our parser, otherwise it will stop working.

For this bot, we are going to use NodeJS to easily install dependency packages that are going to use in this project.

Installing NodeJS and NPM

First, you need to have NodeJS and its package manager, NPM, installed. They are both quite simple to install, you can find information on how to install them at https://nodejs.org/en/download/ and https://www.npmjs.com/get-npm.

In this tutorial I am using Arch Linux, some steps will be on a Linux command-line shell.

More info: NodeJS, NPM, Arch Linux, Linux, Command-line Shell

Creating a NodeJS project

In order to create a new NodeJS project, you will need to create a new folder and use the command npm init on it. The prompt will ask for some information such as project name, author, license, etc. Just fill them in as you please.

1
npm init

Now that the project is created, we are going to install a few development dependencies using the command npm install. Note that we are using --save-dev flag, this is going to make the packages appear in the developer dependecies of the project. If you are interested in understanding the npm dependency model, please check this blog post from Alex King.

1
npm install --save-dev configstore request htmlparser2 emailjs

More info: NPM Packages, Understanding the NPM dependency model

Main Loop

The main loop, also known as the event loop, will be used to periodically call the gather() function that is going to access Craigslist, gather information, filter and send the daily digest email with the items that are of your interest if necessary.

Note that I am using a daily digest email. This way we will iterate through this loop once per day, requiring minimum processing power and network. You can change the setTimeout() argument value for any period of time in milliseconds.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
const Configstore = require('configstore')
const request = require('request')
const htmlparser = require('htmlparser2')
const emailjs = require('emailjs')

const FOUND_ITEMS = []
const NEW_FOUND_ITEMS = []

function mainloop () {
gather()

FOUND_ITEMS.length = 0
NEW_FOUND_ITEMS.length = 0
setTimeout(mainloop, 24 * 60 * 60 * 1000) // daily timeout
}

mainloop()

More info: Event Loop

Gathering the information

First, we define LISTURL, that is going to be any Craigslist URL you choose. Then,LOCATIONFILTER that is going to be a sub-location from the city or region you chose the Craigslist’s from. Finally, SALEFILTER that is going to be the category from which you want to search the items, in this case, we are using zip that in Vancouver’s Craigslist is for “free stuff”.

Next, we are using request() method from request package, that is going to access the URL that contains items that might be of our interest and get the response and the body.

The response variable will be an instance of http.IncomingMessage, that we are going to use to check if the request() was successful or not, using the HTTP response status codes.

The body variable will be the HTML code of the page we requested. We are going to use this data in our parser, then filter the results, and finally send our email.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
const LISTURL = 'http://vancouver.craigslist.ca'
const LOCATIONFILTER = '' // bnc, rds, nvn, rch, pml, van
const SALEFILTER = 'zip/' // sss, ata, ppa, ara, sna, pta, wta, baa, bar, haa, bip, bia, bpa,
// boo, bka, bfa, cta, ema, moa, cla, cba, syp, sya, ela, gra, zip,
// fua, gms, foa, hva, hsa, jwa, maa, mpa, mca, msa, pha, rva, sga,
// tia, tla, taa, tra, vga, waa
function gather () {
request(LISTURL + '/search/' + LOCATIONFILTER + SALEFILTER, function (error, response, body) {
if (error) {
console.log(error)
}

if (response.statusCode !== 200) {
console.log('statusCode:', response && response.statusCode)
}

parser.write(body)
parser.end()

filterResults()

sendEmail()
})
}

More info: Request Package, URL, http.IncomingMessage, HTTP response status codes, Parsing

Parsing the information

First, we define the items that are of our interest in the WANTED_ITEMS variable, in this case, I used the keywords “chair” and “desk”.

We are defining an instance of htmlparser.Parser, that is going to inspect the HTML code by going from HTML tag to tag. This way we are going to spot where the information we want is, and get it.

The parsing must be customized for every different page we want to parse. In this case, I had to check the HTML code, see where the information I wanted was and use the methods from htmlparser.Parser to get this data.

For Craigslist case, we must first check if we are inside of a link (a) tag that has the result-title hdrlnk HTML class, if so, we mark the inLink variable to true and save the link on storedLink variable.

After that, the parser is going to go through a text data, ontext(). Then, if inLink set to true and has text to be processed, we are going to check if this text at least one of the keywords we set on WANTED_ITEMS before. If so, we are saving this item on the FOUND_ITEMS variable.

This way, we are going through all the items that are listed on the website, saving only the ones that are of our interest in a variable that we are going to filter in the next section.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
const WANTED_ITEMS = ['chair', 'desk']

var parser = new htmlparser.Parser({
inLink: false,
storedLink: '',

onopentag: function (name, attribs) {
if (name !== 'a') {
this.inLink = false
return
}

for (let key in attribs) {
if (key === 'class' && attribs[key] === 'result-title hdrlnk') {
this.inLink = true
storedLink = attribs['href']
}
}
},
ontext: function (text) {
if (this.inLink && text) {
WANTED_ITEMS.forEach(function (i) {
if (text.toLowerCase().includes(i)) {
var item = {text, storedLink}
FOUND_ITEMS.push(item)
}
})
}
},
onclosetag: function (tagname) {
if (tagname === 'a') {
this.inLink = false
}
}
}, {decodeEntities: true})

More info: htmlparser2 Package, HTML Tag, HTML Classes

Filtering the information

After we saved the items that are of our interest in the FOUND_ITEMS variable, we are going to check if we had already sent this item to the daily digest email. This way we do not send the same item twice.

To achieve that, we are using Configstore, that is going to load and persist item information for us. For each item found, we are going to check if this item’s link is already saved in the store, if not, we put this item on the NEW_FOUND_ITEMS variable and save into the store.

1
2
3
4
5
6
7
8
9
10
11
function filterResults () {
const store = new Configstore('.craigslistGathererStore')

FOUND_ITEMS.forEach(function (e, i) {
let s = e['storedLink'].substring(9, 19)
if (!store.get(s)) {
NEW_FOUND_ITEMS.push(e)
store.set(s, 'true')
}
})
}

More info: ConfigStore Package,

Sending the information

After we saved the new items that are of our interest in the NEW_FOUND_ITEMS, we are going to send the daily digest email.

First, we configure emailjs server connection using the email user, password, host, SSL and any other configuration your email host might require. Then, we define the function that is going to send the email, we are basically building the body of the email using some string formatting and then calling server.send() that is going to do the magic for us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
const TO_EMAIL = 'YOUR_TARGET_EMAIL'

const server = emailjs.server.connect({
user: 'YOUR_EMAIL_USERNAME',
password: 'YOUR_PASSWORD',
host: 'smtp.yandex.com',
ssl: true
})

function sendEmail () {
let text

if (NEW_FOUND_ITEMS.length === 0) {
text = 'Hey there, We didn\'t find any items for you'
} else {
text = 'Hey there, We\'ve found some cool items for you:\n\n'
NEW_FOUND_ITEMS.forEach(function (e) {
text += e['text'] + ' - ' + LISTURL + e['storedLink'] + '\n'
})
}
text += '\nTake care!\n\n----------\nCraigslist Gatherer'

server.send({
text: text,
from: 'YOUR_NAME <YOUR_EMAIL>',
to: 'Someone <' + TO_EMAIL + '>',
subject: 'Craigslist Gatherer - Daily Digest Email'
}, function (err, message) {
if (err) {
console.log(err || message)
}
})
}

More info: EmailJS Package, Javascript Text Formatting

Ready!

This concludes our simple Craigslist gatherer bot in Javascript. If you are interested in this topic you can search for popular gatherer bots and parsers on Github. There are plenty of open source gatherer bots and parser projects in different languages. You can find the complete code for this bot on my Github Repository