jsPerf.app is an online JavaScript performance benchmark test runner & jsperf.com mirror. It is a complete rewrite in homage to the once excellent jsperf.com now with hopefully a more modern & maintainable codebase.
jsperf.com URLs are mirrored at the same path, e.g:
https://jsperf.com/negative-modulo/2
Can be accessed at:
https://jsperf.app/negative-modulo/2
<script>
var str = "web scraping with nodejs ndash smashing magazine menu search jump to the content smashing magazine smashing pages books ebooks tickets shop email newsletter jobs about us impressum categories coding design mobile graphics ux design wordpresswp x search on smashing magazine search x books ebooks tickets shop jobs rss facebook twitter newsletter search on smashing magazine search coding css html javascript techniques design web design typography inspiration business mobile iphone ipad android design patterns graphics photoshop fireworks wallpapers freebies ux design usability user experience ui design ecommerce wordpresswp essentials techniques plugins themes web scraping with nodejs by elliot bonneville april th javascriptnodejs comments advertisement web scraping is the process of programmatically retrieving information from the internet as the volume of data on the web has increased this practice has become increasingly widespread and a number of powerful services have emerged to simplify it unfortunately the majority of them are costly limited or have other disadvantages instead of turning to one of these thirdparty resources you can use nodejs to create a powerful web scraper that is both extremely versatile and completely free in this article ill be covering the following two nodejs modules request and cheerio that simplify web scraping an introductory application that fetches and displays some sample data a more advanced application that finds keywords related to google searches also a few things worth noting before we go on a basic understanding of nodejs is recommended for this article so if you havent already check it out before continuing also web scraping may violate the terms of service for some websites so just make sure youre in the clear there before doing any heavy scraping modules link to bring in the nodejs modules i mentioned earlier well be using npm the node package manager if youve heard of bower its like that except you use npm to install bower npm is a package management utility that is automatically installed alongside nodejs to make the process of using modules as painless as possible by default npm installs the modules in a folder named nodemodules in the directory where you invoke it so make sure to call it in your project folder and without further ado here are the modules well be using request link while nodejs does provide simple methods of downloading data from the internet via http and https interfaces you have to handle them separately to say nothing of redirects and other issues that appear when you start working with web scraping the request module merges these methods abstracts away the difficulties and presents you with a single unified interface for making requests well use this module to download web pages directly into memory to install it run npm install request from your terminal in the directory where your main nodejs file will be located cheerio link cheerio enables you to work with downloaded web data using the same syntax that jquery employs to quote the copy on its home page cheerio is a fast flexible and lean implementation of jquery designed specifically for the server bringing in cheerio enables us to focus on the data we download directly rather than on parsing it to install it run npm install cheerio from your terminal in the directory where your main nodejs file will be located implementation link the code below is a quick little application to nab the temperature from a weather website i popped in my area code at the end of the url were downloading but if you want to try it out you can put yours in there just make sure to install the two modules were attempting to require first you can learn how to do that via the links given for them above var request requirerequest cheerio requirecheerio url httpwwwwundergroundcomcgibinfindweathergetforecastquery requesturl function error response body if error var cheerioloadbody temperature datavariabletemperature wxvaluehtml consolelogits temperature degrees fahrenheit else consolelogweve encountered an error error so what are we doing here first were requiring our modules so that we can access them later on then were defining the url we want to download in a variable then we use the request module to download the page at the url specified above via the request function we pass in the url that we want to download and a callback that will handle the results of our request when that data is returned that callback is invoked and passed three variables error response and body if request encounters a problem downloading the web page and cant retrieve the data it will pass a valid error object to the function and the body variable will be null before we begin working with our data well check that there arent any errors if there are well just log them so we can see what went wrong if all is well we pass our data off to cheerio then well be able to handle the data like we would any other web page using standard jquery syntax to find the data we want well have to build a selector that grabs the elements were interested in from the page if you navigate to the url ive used for this example in your browser and start exploring the page with developer tools youll notice that the big green temperature element is the one ive constructed a selector for finally now that weve got ahold of our element its a simple matter of grabbing that data and logging it to the console we can take it plenty of places from here i encourage you to play around and ive summarized the key steps for you below they are as follows in your browser link visit the page you want to scrape in your browser being sure to record its url find the elements you want data from and figure out a jquery selector for them in your code link use request to download the page at your url pass the returned data into cheerio so you can get your jquerylike interface use the selector you wrote earlier to scrape your data from the page going further data mining link more advanced uses of web scraping can often be categorized as data mining the process of downloading a lot of web pages and generating reports based on the data extracted from them nodejs scales well for applications of this nature ive written a small datamining app in nodejs less than a hundred lines to show how wed use the two libraries that i mentioned above in a more complicated implementation the app finds the most popular terms associated with a specific google search by analyzing the text of each of the pages linked to on the first page of google results there are three main phases in this app examine the google search download all of the pages and parse out all the text on each page analyze the text and present the most popular words well take a quick look at the code thats required to make each of these things happen as you might guess not a lot downloading the google search link the first thing well need to do is find out which pages were going to analyze because were going to be looking at pages pulled from a google search we simply find the url for the search we want download it and parse the results to find the urls we need to download the page we use request like in the example above and to parse it well use cheerio again heres what the code looks like requesturl function error response body if error consolelogcouldnt get page because of error error return load the body of the page into cheerio so we can traverse the dom var cheerioloadbody links r a linkseachfunction i link get the href attribute of each link var url linkattrhref strip out unnecessary junk url urlreplaceurlq split if urlcharat return this link counts as a result so increment results totalresults in this case the url variable were passing in is a google search for the term data mining as you can see we first make a request to get the contents of the page then we load the contents of the page into cheerio so that we can query the dom for the elements that hold the links to the pertinent results then we loop through the links and strip out some extra url parameters that google inserts for its own usage when were downloading the pages with the request module we dont want any of those extra parameters finally once weve done all that we make sure the url doesnt start with a if so its an internal link to something else of googles and we dont want to try to download it because either the url is malformed for our purposes or even if it isnt malformed it wouldnt be relevant pulling the words from each page link now that we have the urls of our pages we need to pull the words from each page this step consists of doing much the same thing we did just above only in this case the url variable refers to the url of the page that we found and processed in the loop above requesturl function error response body load the page into cheerio var page cheerioloadbody text pagebodytext again we use request and cheerio to download the page and get access to its dom here we use that access to get just the text from the page next well need to clean up the text from the page itll have all sorts of garbage that we dont want on it like a lot of extra white space styling occasionally even the odd bit of json data this is what well need to do compress all white space to single spaces throw away any characters that arent letters or spaces convert everything to lowercase once weve done that we can simply split our text on the spaces and were left with an array that contains all of the rendered words on the page we can then loop through them and add them to our corpus the code to do all that looks like this throw away extra white space and nonalphanumeric characters text textreplacesg replaceazaz g tolowercase split on spaces for a list of all the words on that page and loop through that list textsplit foreachfunction word we dont want to include very short or long words because theyre probably bad data if wordlength return if corpusword if this word is already in our corpus our collection of terms increase the count for appearances of that word by one corpusword else otherwise say that weve found one of that word so far corpusword",
r = new RegExp('count for appearances');
</script>
Ready to run.
Test | Ops/sec | |
---|---|---|
regex |
| ready |
indexof |
| ready |
RegExp |
| ready |
Cached RegExp |
| ready |
You can edit these tests or add more tests to this page by appending /edit to the URL.