Crawling linkedin
September 13, 2017
easy js script automation socialBackground
While linkedin might not be as popular as other social networks, in many countries it has carved out a comfortable niche for itself as the go-to network for keeping track of job related contacts. It’s very annoying to use and doesn’t even try to hide its manipulative nature, but it can be useful from time to time. The most common example would be getting the word out when looking to switch jobs. However, manually going through profiles that might be interesting is a waste of time, and we could hardly call ourselves developers if we didn’t try to somehow automate this drudgery.
One of the most common things we’d like to automate is adding new people. While this might not be the ideal way to enlarge our network, recruiters are generally very happy to connect with potential hires.
There are a few options:
- use the official linkedin API
- write our own crawler
While going with option #1 would be very convenient in theory, we quickly discover that the linkedin API is extremely limited and what’s even worse – poorly documented. The only documentation available is an automatically generated list of endpoints and their outputs, without so much as a sentence written by a human. We are left guessing at the author’s intent.
Therefore, we have decided to write a crawler that takes the same information that we would get while browsing, and extracts the bits we want. For this task, the most suitable language seems to be JavaScript, since it allows us to make use of the browser’s own JavaScript virtual machine, lets us easily access and manipulate the DOM, and so on.
It should:
- go through profiles automatically
- parse profile information and add interesting people
To explore a number of profiles, we’ll need to find out about them, probably from other profiles. Luckily, linkedin profile pages have a sidebar with 10 profiles that “people also viewed” – usually these profiles are similar to the hosting profile (i.e. when viewing a recruiter’s profile, we will be shown other recruiters, often from the same area or company). This functionality can be turned off, but most people stick with the default setting, which is enough for us.
We will need to store these profiles somewhere – a queue is a good fit for this use-case. Since we just want to add people that might turn out to be useful later and we are using linkedin’s own recommendation algorithms (remember, we are using the related profiles) we don’t need any sophisticated search method – a regular breadth first search is sufficient. We simply keep adding new profiles to the end of the queue, and take profiles to connect with from the front. At the same time, we parse the profile page sidebar and queue up the interesting people we find.
Since there’s nothing revolutionary about using a queue, we’ll just use an implementation from quora. There are no classes in JavaScript, so it’s just a prototype with a few methods and a constructor.
To avoid exploring the same profile multiple times, we’ll use a hash table with the profile ID. In JavaScript this is as simple as creating an object and using the IDs as keys.
visited[profileID] = true;
We can use a list of words and look for them in the person’s title.
targetWords = ["technical", "IT", "engineering", "software", "recruiter", "recruitment", "sourcer", "sourcing", "headhunter", "talent", "acquisition", "google", "amazon", "facebook", "uber", "airbnb"];
There are many ways to improve this, e.g. use a few lists of words to look for combinations:
if(matchTitle(title, words1) && matchTitle(title, words2)) {...}
or parse the detailed profile description, but for now it’s more than enough. We can set a lower threshold for adding someone to the queue, and a higher one to try to connect with them – there might be people we don’t want to connect with, who nevertheless, might lead us to other useful connections.
We can use the DOM to access any elements that we need, for example to get the profile’s degree of relation (so we can skip people we are already connected with):
function extractDegreeFromProfilePage() {
degree = document.getElementsByClassName("pv-top-card-section__body")[0].getElementsByClassName("visually-hidden")[0].innerText;
if (degree === undefined) {
return null;
}
return parseInt(degree.slice(0,1), 10);
}
After extracting all the information we want, we will move to the next profile in the queue, we can use location.assign()
to do this, like so:
setTimeout(function () {
next = q.dequeue();
if (next === undefined) {
console.log("queue empty, returning");
return;
}
localStorage.setItem("liQ", JSON.stringify(q));
location.assign(liAddress+next);
}, 30000+randomDelay);
We are also waiting for at least 30 seconds to avoid rate limiting and other anti-bot measures. To get a random integer, we can use a snippet from MDN:
function getRandomInt(min, max) {
min = Math.ceil(min);
max = Math.floor(max);
return Math.floor(Math.random() * (max - min)) + min; //The maximum is exclusive and the minimum is inclusive
}
You might have noticed we are using localStorage
to save our queue. We need to do this because when we move to a new page with location.assign()
, we lose all the variables defined on the previous page. Conveniently, this also makes it easy simple to continue where we left off during previous sessions.
One caveat of this approach is that when retrieving the queue from localStorage
, we get the object and its fields, but the connection to the Queue
prototype is lost and we can’t use its methods. This can be solved by passing a function (reviver) as a second argument when parsing the JSON string. This function is called for each field of the object, and can help us reconstruct the object – similarly to how a constructor would be used when creating an object.
However, we don’t need a general reviver, a simple function that creates a Queue
and assigns it all the fields of the queue retrieved from localStorage
is good enough for us.
function reviver(o) {
q = new Queue();
q.front = o.front;
q.rear = o.rear;
q.size = o.size;
q.queue = o.queue;
return q;
}
q = JSON.parse(localStorage.getItem("liQ"));
q = reviver(q);
While we can run parts of the code in the developer console of our browser to test them out, the final crawler should run automatically. We can use a userscript extension to accomplish this – Tampermonkey for Chrome and Greasemonkey for Firefox. We will soon notice that our script doesn’t share variables with the website, and the following code doesn’t really work the way we’d like it to:
testVar = "test variable";
console.log(testVar); // undefined
We can fix this by granting the script more privileges in the header with // @grant unsafeWindow
and by using the unsafeWindow
object like so:
unsafeWindow.testVar = "test variable";
console.log(testVar); // "test variable"
We’ll do this for our queue as well:
unsafeWindow.Queue = function () {...};
Methods can then be defined on the Queue
object normally, since it’s accessible both to the script and to the website itself.
Another problem we might run into is that we are trying to extract information from elements that haven’t been loaded yet. We can just wait a few seconds while the page loads before calling the rest of our script:
window.onload = setTimeout(function () {
doScrape();
}, 5000);
Since our script runs again every time we visit a new profile, we should remember to make it work with default cases, as well as with edge cases (say – first run):
if (localStorage.getItem("liQ") === null) {
console.log("no queue detected in localStorage, creating a new one");
q = new Queue();
localStorage.setItem("liQ", JSON.stringify(q));
}
console.log("retrieving queue from local storage");
q = JSON.parse(localStorage.getItem("liQ"));
q = reviver(q);
...
In this example, localStorage
explicitly returns null
when the key doesn’t exist, which we use to detect (the lack of) previous sessions.
That’s correct, we didn’t take any steps to prevent infinite growth. We might want to set a limit on the length of the queue and/or cut off the unused front elements from time to time – a good time to do this might be whenever we load a queue from storage and see that there’s a long unused segment at the beginning.
However, even if the script crawls thousands of profiles, it will only take up a marginal amount of space (especially compared to whatever is downloaded every time we visit a bloated website like linkedin). To make it even less of an issue, the data is only used locally and never sent anywhere.
I have a few ideas for further improvements, such as making a configurable command line script and running the crawler in a headless chrome instance, so that we don’t have to do anything at all. Then we can schedule it to run every few hours and forget about it. I might look into it sometime in the future.
The entirety of the code is available here: https://gitlab.com/thisandthat/linkedin-crawler
Feel free to submit and discuss the article on your favourite social media site – I’m curious about other people’s opinions and suggestions. If I have any presence there, I’ll try to respond to questions and feedback.