Assignment 3
Assignment 3
IMPLEMENTATION
Introduction
In the rapidly evolving digital landscape, accessing and analyzing vast troves of web data has
become imperative for businesses and researchers alike. In real-world scenarios, the need
for scaling web crawling operations is paramount. Whether it’s dynamic pricing analysis for
e-commerce, sentiment analysis of social media trends, or competitive intelligence, the
ability to gather data at scale offers a competitive advantage. Our goal is to guide you
through the development of a Google-inspired distributed web crawler, a powerful tool
capable of efficiently navigating the intricate web of information.
The significance of distributed web crawlers becomes evident when we consider the
challenges of traditional, single-node crawling. These limitations encompass issues such as
speed bottlenecks, scalability constraints, and vulnerability to system failures. To effectively
harness the wealth of data on the web, we must adopt scalable and resilient solutions.
Ignoring this necessity can result in missed opportunities, incomplete insights, and a loss of
competitive edge. For instance, consider a scenario where a retail business fails to employ a
distributed web crawler to monitor competitor prices in real-time. Without this technology,
they may miss out on adjusting their own prices dynamically to remain competitive,
potentially losing customers to rivals offering better deals.
In the realm of social media marketing, timely analysis of trending topics is crucial. Without
the ability to rapidly gather data from various platforms, a marketing team might miss the
ideal moment to engage with a viral trend, resulting in lost opportunities for brand
exposure.
These examples illustrate how distributed web crawlers are not just convenient tools but
essential assets for staying ahead in the modern digital landscape. They empower
businesses, researchers, and marketers to harness the full potential of the internet, enabling
data-driven decisions and maintaining a competitive edge.
Golang, Python, NodeJS: We have chose these programming languages for their
strengths in specific components of the crawler, offering a blend of performance,
versatility, and developer-friendly features.
Grafana and Prometheus: These monitoring tools provide real-time visibility into the
performance and health of our crawler, ensuring we stay on top of any issues.
ELK Stack (Elasticsearch, Logstash, Kibana): This trio constitutes our log analysis
toolkit, enabling comprehensive log collection, processing, analysis, and
visualization.
A robust development environment is the foundation of any successful project. Here, we’ll
guide you through setting up the environment for building our distributed web crawler:
1). Install Dependencies: We highly recommend using a Unix-like operating system to install
the packages listed below. For this demonstration, we will use Ubuntu 22.04.3 LTS.
2). Configure AWS and Setup EKS cluster: To create a dedicated AWS Access key and
run aws configure in the terminal of your development machine, please follow the tutorial
available here
aws configure
At this point, you can run kubectl get pods to verify if you can successfully connect to the
remote cluster. Sometimes, you may encounter the following error. In such cases, we
suggest following this tutorial to debug and resolve the version conflict issue.
3). Lens: the most powerful IDE for Kubernetes, allowing you to visually manage your
Kubernetes clusters. Once you have it installed on your computer, you will eventually see
charts as the screenshot shows. However, please note that you will need to install a few
components to enable real-time CPU and memory usage monitoring for your cluster.
1) . Worker Nodes: These are the cornerstone of our distributed crawler. We’ll dedicate
significant attention to them in the following sections. The Golang Crawler will handle
straightforward webpages rendered from the server-side, while the NodeJS crawler will
tackle complex webpages, using a headless browser, such as Chrome. It’s important to note
that a single HTTP request issued by programming languages like Golang or Python is
significantly more resource-efficient (often 10 times or more) compared to requests made
with a headless browser.
2) . Message Queue: For simplicity and remarkable built-in features, we rely on Redis. Here,
the inclusion of Bloom Filters stands out; they are invaluable for filtering duplicates among
billions of records, offering high performance and minimal resource consumption.
3) . Data Storage: The choice of key-value databases, such as MongoDB, is available for
storage. However, if you aspire to make your textual data searchable, akin to Google, Elastic
Search is the preferred option.
4) . Logging: Within our ecosystem, the ELK stack shines. We deploy a Filebeat worker into
each instance as a DaemonSet to collect and ship logs to Elastic Search via Logstash. This is a
critical aspect of any distributed system, as logs play a pivotal role in debugging issues,
crashes, or unexpected behaviors.
5) . Monitoring: Prometheus takes the lead here, enabling the monitoring of common
metrics like CPU and memory usage by pods or nodes. With a customized metric exporter,
we can also monitor metrics related to crawling tasks, such as the real-time status of each
crawler, the total processed URLs, crawling rates per hour, and more. Moreover, we can set
up alerts based on these metrics. Blind management of a distributed system with numerous
instances is not advisable; Prometheus ensures that we have clear insights into our system’s
health.