How to cite this paper
Gross, Mark, Tammy Bilitzky, Rich Dominelli and Allan Lieberman. “White Hat Web Crawling: Industrial-Strength Web Crawling for Serious Content Acquisition.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). https://doi.org/10.4242/BalisageVol21.Gross01.
Balisage: The Markup Conference 2018
July 31 - August 3, 2018
Balisage Paper: White Hat Web Crawling: Industrial-Strength Web Crawling for Serious Content Acquisition
Mark Gross
President
Data Conversion Laboratory
Mark Gross, President of Data Conversion Laboratory, is a recognized authority on
XML implementation, document conversion, and data mining. Prior to founding DCL in
1981, he was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering
from Columbia University and an MBA from New York University, and has taught at the
New York University Graduate School of Business, the New School, and Pace University.
Tammy Bilitzky
Chief Information Officer
Data Conversion Laboratory
Tammy Bilitzky is Data Conversion Laboratory (DCL)’'s Chief Information Officer. Serving
with DCL since 2013, Tammy is responsible for managing the company’s technology department;
continuing its focus on resilient, high-quality, and innovative products; and helping
to expand the business. She has extensive experience in using technology to deliver
client value, supporting business-process transformation and managing complex, large-scale
programs on and off shore. She holds a BS in computer science and business administration
from Northeastern Illinois University and is a Project Management Professional, Six
Sigma Green Belt, and Certified Scrum Master.
Rich Dominelli
Lead Software Engineer
Data Conversion Laboratory
As Lead Software Engineer, Rich brings over 25 years of System Architecture experience
to Data Conversion Laboratory (DCL). Applying his education from Iona College, the
University of Phoenix, and Stony Brook University, he has been solving problems and
designing resilient solutions on everything from microcontrollers to mobile phones
to state of the art web based meter data management systems. Most recently Rich has
been focusing creating intelligent targeted web crawlers.
Allan Lieberman
Special Projects Manager
Data Conversion Laboratory
With a comprehensive technical background in both computer software development and
large scale database design and applications, Special Projects Manager Allan Lieberman
currently oversees Data Conversion Laboratory (DCL)'s efforts in identifying and accessing
legal content on websites worldwide, and provides technical guidance both in-house
and to clients. Allan joined DCL in 2012, following 25 years with the Information
Systems department of Davis Polk & Wardwell, a leading global law firm, where his
most recent position was Manager of Software Design and Systems Development. He holds
a BA in Mathematics from City College of New York, and an MS in Computer Science from
Polytechnic University of New York.
Copyright ©2018 Data Conversion Laboratory, Inc.
Abstract
Much original source material today appears only on the web or with the web version
as the copy of record. We have been developing methods and bots to facilitate high-volume
data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF,
DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified
data stream which we then convert into XML for ingestion into derivative databases,
data analytics platforms, and other downstream systems. We will examine the thought
behind our approaches, the analysis techniques we used to detect and deal with website
and content anomalies, our methods to detect meaningful content changes, and our approaches
to verification.
Table of Contents
- Overview
- White Hat versus Black Hat - Good Guys or the Bad Guys?
- Why Do We Need This Information?
- Why All the Fuss?
- It is Not “One Size Fits All”
- One Best Approach? Wishful Thinking….
- Our Methodology - How to Focus on What Matters
-
- Does the website use a standard CMS (e.g. Drupal, Joomla or WordPress)?
- What is the Underlying Technology Stack?
- What security and authentication are in place?
- Rules for polite web crawling to avoid being blocked
- Is there an API/RSS feed available?
- Does the website have bugs - and how severe?
- Crawler Magic - From Their Website to Ours
-
- Page Downloading
- Page Parsing
- Metadata Extraction
- Page Filtering
- Page Differencing
- No plan survives contact with a webmaster
- What is Next? Beyond 2018….
Overview
Vast amounts of business-critical information appear only on public websites that
are constantly updated to present both new and modified content. While the information
on many of these websites is extremely valuable, no standards exist today for the
way content is organized, presented and formatted, or for how individual websites
are constructed or accessed.
This creates a significant challenge for companies that require data sourced from
these websites in a timely manner, which they need downloaded and structured to support
business practices and downstream systems.
The paper will focus on specific impediments that we typically encounter and tactics
we’ve adopted to overcome them in the process of creating a streamlined, automated
process to crawl websites, scrape content and metadata, and transform the content
into a standardized XML format. Our comments and recommendations are based on having
successfully traversed hundreds of variated, multilingual, multi-platform, global
websites.
We will elaborate on our methodology and bots used to facilitate high-volume data
retrieval in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in English,
European and Asian languages, with varying organizational approaches.
White Hat versus Black Hat - Good Guys or the Bad Guys?
While Web Crawling sounds somewhat nefarious, there is an important white-hat side
to it. Much original source material today appears only on the Web. For many government
agencies and various NGO’s, the web version is the “document of record,” the most
current version available, and where you are referred when you make inquiries regarding
reports, articles, whitepapers, etc.
While there are many tools available to handle the basic crawling and scraping of
websites, they mostly work on one website at time. Analyzing and traversing volumes
of complex websites - somewhat like developing autonomous vehicles - requires the
ability to adapt to changing conditions, across websites and over time. The presentation
will examine the thought processes behind our approaches, including website analysis,
techniques to detect and deal with website and content anomalies, methods to detect
meaningful content changes, and approaches to verifying results.
Why Do We Need This Information?
There is a vast amount of data available on websites, from informative to entertaining
to legal, with critical content that services a wide variety of purposes, depending
on the business need. The most common endgame is the normalization, decomposition,
and transformation of the information into a structured format to power derivative
databases, data analytics platforms, and other downstream systems.
Why All the Fuss?
It is estimated that there are 4.52 billion webpages out in the wild (http://www.worldwidewebsize.com/).
Many of these are maintained by webmasters who are certain that their architecture
for running a web website is the best one, as opposed to the guy one page over.
Of course, it would be nice if all websites offered a convenient, reliable method
to download and monitor their content, but most do not do so. It would also be helpful
if the different websites complied with standards to make new and modified content
easier to find and extract, but no such luck; the variations are endless and often
at the whim of the developer and content owner. It would also be helpful if once
the website was in place, its design and structure remained static; that does not
happen either. Compounding it, in our hack-worried world, some restrict or limit
access to prevent malicious intruders, at the expense of legitimate users. Finally,
software bugs introduced inadvertently by developers add to the challenge.
It is Not “One Size Fits All”
“We want the content from www.very_important_content.com.” These marching orders launch
our focused functional and technical analysis of each new website.
Understanding the design of each individual website is a prerequisite for successful
crawler automation and content harvesting. Our methodology guides both the website
analysts, and later, the developers, through a series of questions designed to derive
the best approach for each unique website and content set.
Some critical questions to ask are the following:
-
How does the website work? Where is the data of interest and how is it accessed? The possibilities are endless and include traversing menus, sequencing through tables
of content, clicking on headlines, and entering search terms.
-
How is the website content organized? Date order, subject matter, etc.? Understanding the way a website is organized is critical to locating the content
you need and avoiding duplicates.
-
What is the website depth? How many links do we need to traverse to access the content? Depending on your business need, you may want to limit your search depth.
-
Is all the required metadata available on the website or does it need to be extracted
from the content itself? Metadata is often even more important than the content itself and is needed for validation
and search. Getting the metadata from the best source is key.
-
How large is the website? Is one crawler able to process it in its entirety? The resulting crawl process must
be executed in a timely manner; this is a major consideration for the developer when
configuring each website crawl.
-
How consistent is the design of the entire website? How large a sampling is required
to successfully specify requirements for the crawler automation? Some websites are highly structured and organized. Others have a surprise on every
page.
One Best Approach? Wishful Thinking….
One learns quickly that one size does not fit all. It is not feasible to design one
approach to intelligently crawl even a small subset of these websites. Even within
the same department, the web page layout and backend technology often varies, requiring
frequent customizations.
Modern websites have progressed far beyond simple html pages to interactive, database
driven applications using logic residing both on the client page and server based
code. This forced a transition from NCSA Mosaic (last released OSX version size 1.7
mb) to the current version of Google Chrome, weighing in at a hefty 554 mb at the
time of this writing. This growth in application size represents the ever-expanding
feature set supported in modern browsers.
Our Methodology - How to Focus on What Matters
In order to attack these problems, we’ve identified a focused series of questions
that guide the developer through the decision process and determine an optimized approach
to extract content and metadata from each specific website. These include:
Does the website use a standard CMS (e.g. Drupal, Joomla or WordPress)?
Consistency is the primary advantage to crawling a website that uses a standard CMS.
The page layouts follow a pattern, the lists of content are organized with the same
tagging scheme and will often will share the same metadata tagging across pages.
What is the Underlying Technology Stack?
If the website is hosted using ASP.Net webforms, paging and navigation is typically
implemented as form posts. If it is an Angular website, it may make heavy use of
Ajax or a Single Page Application paradigm (SPA). The actual URL holding the content
may not be immediately obvious, requiring emulation of the JavaScript enabled browser
or monitoring requests in the browser’s html debugging tools to see how the data is
being sourced. A similar situation will occur if a website makes heavy use of frames;
often the actual content url is not the url in the address bar.
What security and authentication are in place?
Does the website require a logon? Does it require cookies or other headers that accompany
the call and must be maintained between calls? The fastest way to crawl a website
is to connect to the specific web address (URI) and retrieve the response, using the
HTTP GET command, and then stream the results to a file. If the interaction between
the server and browser is complex, it’s unlikely it is that this approach will work.
Rules for polite web crawling to avoid being blocked
The difference between a DDOS attack and an aggressive crawler is slim. It is a fairly
simple task to write a web crawler which spawns many threads, all simultaneously grabbing
content from a given website to quickly extract all the content. However, this method
will quickly get your IP address blacklisted and block you from the website. A preferred
method is to minimize the simultaneous connections and insert artificial pauses between
the requests, mimicking normal user browser behavior. Even so, some websites will
limit the number of files you can download, in a given day, from the same IP address.
To avoid this, you either have to request files from multiple addresses or hook into
the TOR network to use a different IP address on every request.
Is there an API/RSS feed available?
Some websites have a clean API available allowing you to pull the data in via a simple
REST or SOAP call, including a few Federal websites. Others expose their content
via RSS (Really Simple Syndication), eliminating the need to parse the HTML pages.
Does the website have bugs - and how severe?
Bugs can range from simple broken links and unavailable images, to flawed paging logic
that only manifest itself when you are well into the development of a crawl. In some
cases, webmasters are responsive and will address, or at least acknowledge the flaws
in their website - but often you simply have to find a way to work around the flaws.
Crawler Magic - From Their Website to Ours
Rarely are two websites alike. A viable crawl solution must accommodate the unique
aspects of a website without starting from scratch each time we face a new nuance.
Our toolchain approach, in which a set of components are assembled into a crawler,
is our preferred method for crawling large numbers of diverse websites in an efficient,
timely manner and has proven very effective.
Some of the components we configure in our toolchain approach include:
Page Downloading
At its core, a web crawler is a mechanism for bulk downloading pages. The simplest
mechanism is an HTTP GET, the HTTP command to access a URI and retrieve a response.
This only returns the full page for simpler websites, but has a tremendous speed advantage
and is our default mechanism. For sites that require cookies, we supplement the HTTP
GET accordingly.
Pages are often loaded or changed dynamically by client-side scripts. Sections of
text may be appended, deleted, or expanded. As our ultimate goal is downloading the
complete contents of the page, we may need to emulate a browser.
Page Parsing
Parsing will grab elements from within the page and intelligently process them. There
are several common approaches for selecting and navigating elements within web pages.
-
CSS selectors are commonly used by many JavaScript tools to quickly grab html elements
and action them. Often many elements have no class, lack a distinct identifier, or
repeat frequently.
-
Some pages rely on unique identifiers for the elements in question, but often only
uniquely tag those elements they are interested in manipulating via CSS or JavaScript.
-
Many developer tools, e.g. Firebug and the Chrome Developer tool, let you query via
Xpath to interactively preview your result, providing a more robust query language
to quickly filter and navigate between elements.
The strength of using Xpath is based on the similarities between HTML and XML. HTML
is relatively unstructured compared to XML and may not be well-formed. Thankfully,
most languages have a forgiving parser that allow you to treat HTML as if it were
XML. These parsers support a generic, Xpath-based mechanism for narrowing the relevant
elements of a page and walking the elements of the page for metadata extraction and
more complex filtering.
Metadata Extraction
In addition to the html documents, we are usually required to extract metadata from
index or other pages. By walking the elements surrounding the link that led to this
page, we are able to extract information surrounding the link that lead us to this
page, similar to walking up and down the document object model.
Page Filtering
There are several options for filtering pages in a crawl:
-
Limit the section of the page examined for links using Xpath.
-
Examine the link itself for keywords that indicate that the content is not in scope
or duplicated.
-
A final option is to apply logical filters, e.g. filtering out historic versions of
a page.
Page Differencing
As you advance beyond simple file comparison, determining whether a page has changed
on a website is a complex task; often requiring a multi-step process:
-
Isolate only those areas of interest on the page.
-
Strip tags that do not affect the meaning of the page such as head elements, style
tags, JavaScript, and attributes within the tag.
-
Assess if the difference is material.
Switching from straight quotes to curly quotes or from normal spaces to non-breaking
spaces are not usually meaningful. Other changes are more subtle such as paragraph
transitions from preformatted text (<pre>) to lines contained within a paragraph or
lines split by breaks - with no actual text differences.
-
Apply intelligence to chunk sentences and sentence fragments to compare each word.
No plan survives contact with a webmaster
Sites change, pages are updated with new character sets, update notification pages
are frequently wrong, links die or are changed. We couple our automated crawling
with automated validation to ensure that we have all the required files and metadata.
When we find discrepancies, alerts are issued and our website analysts often reach
out to the webmasters. From then on, it is uncertain as to whether we will get a
response or resolution, and we often have to implement workarounds.
What is Next? Beyond 2018….
Over the past two and a half years, we have developed a series of best practices for
web crawling and harvesting technologies, achieving fully automated processing against
a wide range of diverse, complex and often poorly structured websites. Our methodology
has been iteratively refined to accommodate the ever-changing landscape of internet
content and facilitate a model of continuous improvement.
We are far from done. While not there yet, we are well on our way to eliminate manual
intervention and further automate website analysis, reducing greatly the manual effort
to research and resolve problems. We are starting to leverage our volumes of training
data to create training sets for machine learning based troubleshooting and information
extraction, and it is already demonstrating significant potential.
Our current road map includes utilizing TensorFlow, NLP and supervised machine learning
to classify sections of text, extract references and metadata and to supplement our
quality control, all targeted to improve the consistency and reliability of our results
- and to do it faster and better.