Fandango.com Data Scraping: November 2014

Sunday, 30 November 2014

Web Scraping’s 2013 Review – part 2

As promised we came back with the second part of this year’s web scraping review. Today we will focus not only on events of 2013 that regarded web scraping but also Big data and what this year meant for this concept.

First of all, we could not talked about the conferences in which data mining was involved without talking about TED conferences. This year the speakers focused on the power of data analysis to help medicine and to prevent possible crises in third world countries. Regarding data mining, everyone agreed that this is one of the best ways to obtain virtual data.

Also a study by MeriTalk a government IT networking group, ordered by NetApp showed this year that companies are not prepared to receive the informational revolution. The survey found that state and local IT pros are struggling to keep up with data demands. Just 59% of state and local agencies are analyzing the data they collect and less than half are using it to make strategic decisions. State and local agencies estimate that they have just 46% of the data storage and access, 42% of the computing power, and 35% of the personnel they need to successfully leverage large data sets.

Some economists argue that it is often difficult to estimate the true value of new technologies, and that Big Data may already be delivering benefits that are uncounted in official economic statistics. Cat videos and television programs on Hulu, for example, produce pleasure for Web surfers — so shouldn’t economists find a way to value such intangible activity, whether or not it moves the needle of the gross domestic product?

We will end this article with some numbers about the sumptuous growth of data available on the internet. There were 30 billion gigabytes of video, e-mails, Web transactions and business-to-business analytics in 2005. The total is expected to reach more than 20 times that figure in 2013, with off-the-charts increases to follow in the years ahead, according to researches conducted by Cisco, so as you can see we have good premises to believe that 2014 will be at least as good as 2013.

Source:http://thewebminer.com/blog/2013/12/

Sunday, 23 November 2014

Outsourcing Data Mining is a Wise Business Decision

Most businesses nowadays have a large volume of raw data that is never processed, because of the lack of time or resources. If your business is facing a similar situation, then you are missing out on valuable information. Without the right information, your company will be unable to make accurate business decisions.

The right information can play a key role in promoting the growth of your business. When unprocessed data is entered, filtered, classified and converted into a workable format, it can be used to maximize your profits, ameliorate your risks and run a seamless workflow.

Over the years, data mining has proved to be extremely useful in various industries, be it, healthcare, direct marketing, e-commerce, finance, customer relationship management or telecommunications. With the right information, companies have been able to make fast and effective business decisions.

Why outsource data mining?

Data mining requires the expertise of professional business and financial analysts who understand how to acquire important information from vast amounts of data. If data mining is done in-house, it can become expensive and time consuming. It can also shift your focus away from core business activities. Outsourcing data mining on the other hand is more fast, cost-effective and can give you access to professional services.

4 commonly outsourced data mining functions

Most companies outsource one or more of the following data mining functions to India:

1. Data congregation: Data is extracted from various web pages and websites, by using methods like web and screen scraping. The collected data is then entered into a database.

2. Contact data collection: Different websites are searched and information concerning contacts is collected.

3. E-commerce data: Data about varied online stores are collected, taking into account information about prices, discounts and products.

4. Data about competitors: Data about business competitors are collected to help a company gauge itself against its competition. With such valuable data, you can effectively re-design your marketing strategy and pricing matrix.

8 advantages of outsourcing data mining to India

With data mining out of your hands, your business can make huge savings in terms of time, money and infrastructure. The following are some of the benefits that you can leverage by outsourcing data mining to India:

    Get qualified and highly skilled data mining experts to work for you at an extremely affordable cost

    Be assured of the quality of information, as Indian data entry companies only extract information from reliable websites and databases

    Save on the cost of investing on the latest data mining software and technology, as your Indian service provider will be making these investments

    Get your data processed within a short turnaround time of 3,6 or 12 hours as Indian data mining companies can provide efficient data mining within a few hours

    When compared to in-house data mining, outsourcing data mining can be a lot cheaper and also bring you better results

    Stay assured about the complete privacy, security and confidentiality of your valuable data as Indian data mining companies use the latest technology to ensure 100% safety

    Get access to data with a wide market coverage as your Indian data mining provider will be serving many business with varied data mining needs

    Improve your overall productivity and generate more profits by making informed decisions about your business

Have you outsourced data mining before? If yes, which data mining service did you outsource? Did you find outsourcing more advantageous that in-house data mining. Let us know.

Source: http://blog.flatworldsolutions.com/outsourcing-data-mining-is-a-wise-business-decision/

Monday, 17 November 2014

Get started with screenscraping using Google Chrome’s Scraper extension

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.

What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.

Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.

Image

Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs

Image

This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths

At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:

    //div[@id="content"]/table[1]/tr

Which in plain English translates to:

    // - Search the whole document...

    div[@id="content"] - ...for the div tag with the id "content".

    table[1] - Select the first table.

    tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:

    //section[1]/div/div/div/dl/dt/a

The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

    //section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

    name: dt/a
    party: dd[1]
    region: dd[2]/span[1]
    seat: dd[2]/span[2]
    phone: dd[3]
    e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet.

Source: http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/

Saturday, 15 November 2014

Screen-scraping with WWW::Mechanize

Screen-scraping is the process of emulating an interaction with a Web site - not just downloading pages, but filling out forms, navigating around the site, and dealing with the HTML received as a result. As well as for traditional lookups of information - like the example we'll be exploring in this article - we can use screen-scraping to enhance a Web service into doing something the designers hadn't given us the power to do in the first place. Here's an example:

I do my banking online, but get quickly bored with having to go to my bank's site, log in, navigate around to my accounts and check the balance on each of them. One quick Perl module (Finance::Bank::HSBC) later, and now I can loop through each of my accounts and print their balances, all from a shell prompt. Some more code, and I can do something the bank's site doesn't ordinarily let me - I can treat my accounts as a whole instead of individual accounts, and find out how much money I have, could possibly spend, and owe, all in total.

Another step forward would be to schedule a crontab every day to use the HSBC option to download a copy of my transactions in Quicken's QIF format, and use Simon Cozens' Finance::QIF module to interpret the file and run those transactions against a budget, letting me know whether I'm spending too much lately. This takes a simple Web-based system from being merely useful to being automated and bespoke; if you can think of how to write the code, then you can do it. (It's probably wise for me to add the caveat, though, that you should be extremely careful working with banking information programatically, and even more careful if you're storing your login details in a Perl script somewhere.)

Back to screen-scrapers, and introducing WWW::Mechanize, written by Andy Lester and based on Skud's WWW::Automate. Mechanize allows you to go to a URL and explore the site, following links by name, taking cookies, filling in forms and clicking "submit" buttons. We're also going to use HTML::TokeParser to process the HTML we're given back, which is a process I've written about previously.

The site I've chosen to demonstrate on is the BBC's Radio Times site, which allows users to create a "Diary" for their favorite TV programs, and will tell you whenever any of the programs is showing on any channel. Being a London Perl M[ou]nger, I have an obsession with Buffy the Vampire Slayer. If I tell this to the BBC's site, then it'll tell me when the next episode is, and what the episode name is - so I can check whether it's one I've seen before. I'd have to remember to log into their site every few days to check whether there was a new episode coming along, though. Perl to the rescue! Our script will check to see when the next episode is and let us know, along with the name of the episode being shown.

Here's the code:

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTML::TokeParser;

If you're going to run the script yourself, then you should register with the Radio Times site and create a diary, before giving the e-mail address you used to do so below.

my $email = ";
die "Must provide an e-mail address" unless $email ne ";

We create a WWW::Mechanize object, and tell it the address of the site we'll be working from. The Radio Times' front page has an image link with an ALT text of "My Diary", so we can use that to get to the right section of the site:

my $agent = WWW::Mechanize->new();
$agent->get("http://www.radiotimes.beeb.com/");
$agent->follow("My Diary");

The returned page contains two forms - one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize's list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,'2.')

$agent->form(2);

Now we can fill in our e-mail address for the '<INPUT name="email" type="text">' field, and click the submit button. Nothing too complicated.

$agent->field("email", $email);
$agent->click();

WWW::Mechanize moves us to our diary page. This is the page we need to process to find the date details from. Upon looking at the HTML source for this page, we can see that the HTML we need to work through is something like:

<input>
<tr><td></td></tr>
<tr><td></td><td></td><td class="bluetext">Date of episode</td></tr>
<td></td><td></td>
<td class="bluetext">Time of episode</td></tr>
<a href="page_with_episode_info"></a>

This can be modeled with HTML::TokeParser as below. The important methods to note are get_tag - which will move the stream on to the next opening for the tag given - and get_trimmed_text, which returns the text between the current and given tags. For example, for the HTML code "Bold text here", my $tag = get_trimmed_text("/b") would return "Bold text here" to $tag.

Also note that we're initializing HTML::TokeParser on '\$agent->{content}' - this is an internal variable for WWW::Mechanize, exposing the HTML content of the current page.

my $stream = HTML::TokeParser->new(\$agent->{content});
my $date;
 # <input>
$stream->get_tag("input");
# <tr><td></td></tr><tr>
$stream->get_tag("tr"); $stream->get_tag("tr");
# <td></td><td></td>
$stream->get_tag("td"); $stream->get_tag("td");
# <td class="bluetext">Date of episode</td></tr>
my $tag = $stream->get_tag("td");
if ($tag->[1]{class} and $tag->[1]{class} eq "bluetext") {
 $date = $stream->get_trimmed_text("/td");
 # The date contains ' ', which we'll translate to a space.
 $date =~ s/\xa0/ /g;
}
 # <td></td><td></td>
$stream->get_tag("td");
# <td class="bluetext">Time of episode
$tag = $stream->get_tag("td");
if ($tag->[1]{class} eq "bluetext") {
 $stream->get_tag("b");
 # This concatenates the time of the showing to the date.
 $date .= ", from " . $stream->get_trimmed_text("/b");
}
# </td></tr><a href="page_with_episode_info"></a>
$tag = $stream->get_tag("a");
# Match the URL to find the page giving episode information.
$tag->[1]{href} =~ m!src=(http://.*?)'!;

We have a scalar, $date, containing a string that looks something like "Thursday 23 January, from 6:45pm to 7:30pm.", and we have an URL, in $1, that will tell us more about that episode. We tell WWW::Mechanize to go to the URL:

$agent->get($1);

The navigation we want to perform on this page is far less complex than on the last page, so we can avoid using a TokeParser for it - a regular expression should suffice. The HTML we want to parse looks something like this:

 Episode The Episode Title 

We use a regex delimited with '!' in order to avoid having to escape the slashes present in the HTML, and store any number of alphanumeric characters after some whitespace, all between tags after the Episode header:

$agent->{content} =~ m! Episode \s+?(\w+?) !;

$1 now contains our episode, and all that's left to do is print out what we've found:

my $episode = $1;
print "The next Buffy episode ($episode) is on $date.\n";

And we're all set. We can run our script from the shell:

$ perl radiotimes.pl

The next Buffy episode (Gone) is Thursday Jan. 23, from 6:45 to 7:30 p.m.
I hope this gives a light-hearted introduction to the usefulness of the modules involved. As a note for your own experiments, WWW::Mechanize supports cookies - in that the requestor is a normal LWP::UserAgent object - but they aren't enabled by default. If you need to support cookies, then your script should call "use HTTP::Cookies; $agent->cookie_jar(HTTP::Cookies->new);" on your agent object in order to enable session-volatile cookies for your own code.
Happy screen-scraping, and may you never miss a Buffy episode again.

Source: http://www.perl.com/pub/2003/01/22/mechanize.html

Thursday, 13 November 2014

Future of Web Scraping

The Internet is large, complex and ever-evolving. Nearly 90% of all the data in the world has been generated over the last two years. In this vast ocean of data, how does one get to the relevant piece of information? This is where web scraping takes over.

Web scrapers attach themselves, like a leech, to this beast and ride the waves by extracting information form websites at will. Granted “scraping” doesn’t have a lot of positive connotations, yet it happens to be the only way to access data or content from a web site without RSS or an open API.

Future of Web Scraping

Web scraping faces testing times ahead. We outline why there may be some serious challenges to its future.

With rise in data, redundancies in web scraping are rising. No more is web scraping a domain of the coders; in fact, companies now offer customized scraping tools to clients which they can use to get the data they want. The outcome of everyone equipped to crawl, scrape, and extract, is unnecessary waste of precious man-power. Collaborative scraping could well heal this hurt. Here, where one web crawler does a broad scraping, the others scrape data off an API. An extension of the problem is that text retrieval attracts more attention than multimedia; and with websites becoming more complex, this enforces limited scraping capacity.

Easily, the biggest challenge to web scraping technology is Privacy concerns. With data freely available (most of it voluntary, much of it involuntary), the call for stricter legislation rings loudest. Unintended users can easily target a company and take advantage of the business using web scraping. The disdain with which “do not scrape” policies are treated and terms of usage violated, tells us that even legal restrictions are not enough. This begs to ask an age-old question: is scraping legal?

Is Crawling Legal? from PromptCloud

The flipside to this argument is that if technological barriers replace legal clauses, then web scraping will see a steady, and sure, decline. This is a distinct possibility since the only way scraping activity thrives is on the grid, and if the very means are taken away and programs no longer have access to website information, then web scraping by itself will be wiped out.

Building the Future

On the same thought is the growing trend of accepting “open data”. The open data policy, while long mused hasn’t been used at the scale it should be. The old way was to believe that closed data is the edge over competitors. But that mindset is changing. Increasingly, websites are beginning to offer APIs and embracing open data. But what’s the advantage of doing so?

Selling APIs not only brings in the money, but also is useful in driving back traffic to the sites! APIs are also a more controlled, cleaner way of turning sites into services. Steadily many successful sites like Twitter, LinkedIn etc. are offering access to their APIs with paid services and actively blocking scraper and bots.

Yet, beyond these obvious challenges, there’s a glimmer of hope for web scraping. And this is based on a singular factor: the growing need for data!

With Internet & web technology spreading, massive amounts of data will be accessible on the web. Particularly with increased adoption of mobile internet. According to one report, by 2020, the number of mobile internet users will hit 3.8 billion, or around half of the world’s population!

Since ‘big data’ can be both, structured & unstructured; web scraping tools will only get sharper and incisive. There is fierce competition between those who provide web scraping solutions. With the rise of open source languages like Python, R & Ruby, Customized scraping tools will only flourish bringing in a new wave of data collection and aggregation methods.

Source: https://www.promptcloud.com/blog/Future-of-Web-Scraping

Wednesday, 12 November 2014

3 Reasons to Up Your Web Scraping Game

If you aren’t using a machine-learning-driven intelligent Web scraping solution yet, here are three reasons why you might want to abandon that entry-level Web-scraping software or cut your high-cost script-writing approach.

    You need to keep an eye on a large number of web sources that get updated frequently.

    Understanding what’s changed is at least as critical as the data itself.

    You don’t want maintenance and scheduling to drag you down.

Here’s what an intelligent Web-scraping solution can deliver – and why:

1. Better data monitoring of an ever-shifting Web

If you need to keep a watch over hundreds, thousands or even tens of thousands of sites, an intelligent Web scraper is a must, because:

    It can scale – easily adding new websites, coordinating extraction routines, and automating the normalization of data across different websites.

    It can navigate and extract data from websites efficiently. Script-based approaches typically only can view a Web page in isolation, making it difficult to optimize navigation across unique pages of a targeted site. More intelligent approaches can be trained to bypass unnecessary links and leave a lighter footprint on the sites you need to access. And, they can monitor millions of precise Web data points quickly. This means you can monitor more pages on more sites with more frequent updates.

2. Critical alerts to Web data changes

A key sales executive suddenly drops off of the management page of your main competitor. That can mean big shakeup in the entire organization, which your sales team can jump on.

An intelligent Web scraper can alert you to this personnel shift because it can be set to monitor for just the changes; less powerful technologies or script-based approaches can’t. Whether you’re tracking price shifts, people moves, or product changes (or more) intelligent Web scraping delivers more profound insights.

3. Maintenance may become your biggest nightmare

You’ve purchased an entry-level tool and built out scrapers for a few hundred sites. At first, everything seems fine. But, within weeks you begin to notice that your data is incomplete and not being updated as you’d expected. Why did your data deliveries disappear?

Reality is that these low-cost tools are simply not designed for mission-critical business applications – on the surface they look helpful and easy to use, but underneath the surface they are script-based and highly dependent upon the HTML of a website. But websites change, and entry-level web scraping tools are simply not engineered to adapt to those changes.

And, most of these tools are simply not designed for enterprise use. They have limited reporting, if any, so the only way to know whether they’re successfully completing their tasks is by finding gaps in the data – often when it’s too late.

An intelligent web scraping approach doesn’t rely upon the HTML of a web page. It uses machine learning algorithms which view the web the same way a user might. A typical reader doesn’t get confused when a font or color is changed on a website, and neither do these algorithms. But simple approaches to web scraping are highly dependent on the specific HTML to help it understand the content of a page. So, when websites have design changes (on average once every 18 months), the software fails.

While entry-level web scraping software can be an easy solution for simple, one-time web scraping projects, the scripts they generate are fragile and the resources required for tracking and maintenance can become overwhelming when you need to regularly extract data from multiple sites.

Case in point: Shopzilla assimilates data five times faster than outsourced Web scrapers

To demonstrate the power of intelligent Web scraping, here’s a real-life example from Shopzilla. Shopzilla manages a premier portfolio of online shopping brands in the United States and Europe, connecting more than 40 million shoppers each month with millions of products from retailers worldwide. With the explosive growth of retail data on the Web, Shopzilla’s outsourced, custom-built approach, based on scripting, could not add the product lines of new retailers to its site in a timely fashion. It was taking up to two weeks to write the scripts needed to make a single site accessible.

By deploying Connotate’s intelligent web scraping platform on site, Shopzilla gained the ability to harness Web data’s rapid growth and keep up to date. Today, new sources are added in days, not weeks. The platform continually monitors Web content from thousands of sites, delivering high volumes of data every day in a structured format. The result: 500 percent more data from new retailers. An added bonus: the company has reduced IT maintenance costs and its dependence on outsourced development timetables. Case in point: Deep competitor intelligence in two languages

A global manufacturer needed to monitor competitors’ technology improvements in a field where market leadership hinges on an ability to quickly leverage these advances. That meant accessing scholarly journals and niche sites in multiple languages. Using the Connotate solution, it was able to access highly-targeted, keyword-driven university and industry research journals and blogs in German and English that are hard to reach because they do not support RSS feeds. Our solution also incorporated semantic analysis to tag and categorize data and help identify new technologies and products not currently in the keyword list. The firm enhanced its competitive edge with the up-to-the-minute, precise data it needed.

Is your Web scraping intelligent enough?

See what intelligent agents through an automated Web data extraction and monitoring solution can bring to your business. Contact us and speak with one of experts.

Source:http://www.connotate.com/3-reasons-web-scraping-game-6579#.VGMjH2f4EuQ

Saturday, 8 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled. Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Wednesday, 5 November 2014

Application of Web Data Mining in CRM

The process of improvising the customer relations and interactions and making them more amicable may be termed as Customer relationship management (CRM). Since web data mining is used in the utilization of the various modeling and data analysis methods in detecting given patterns and relationships in the data, it can be used as an effective tool in CRM. By the effectively using web data mining you are able to understand what your customers what.

It is important to note that web data mining can be used effectively in searching for the right and potential customers to be offered the right products at the right time. The result of this in any business is the increase in the revenue generated. This is made possible as you are able to respond to each customer in an effective and efficient way. The method further utilizes very few resources and can be therefore termed as an economical method.

In the next paragraphs we discuss the basic process of customer relationship management and its integration with web data mining service. The following are the basic process that should be used in understanding what your customers need, sending them the right offers and products, and reducing the resources used in managing your customers.

Defining the business objective. Web data mining can be used to define and inform your customers your business objective. By doing research you can be able to determine whether your business objective is communicated well to your customers and clients. Does your business objective take interest in the customers? Your business goal must be clearly outlined in your business CRM. By having a more precise and defined goal is the possible way of ensuring success in the customer relationship management.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/application-web-data-mining-crm/