This project has consisted on the creation of a web application that given a URL it will retrieve the robots.txt file of the website and display it as an HTML page.
To run this project you have two options, to run it with docker, or run it using python:
-
git clone https://github.com/aamargant/robots.git
-
cd robots/
Below is an exmaple of how can you deploy the application using Docker:
-
Build the docker image
docker build . -t django-robots-v0.0.1
-
Run the docker image:
docker run -it -p 8000:8000 django-robots-v0.0.1
Note: When running with docker the logs will state that the development server started at http://0.0.0.0:8000/
but if you want it to access through your local computer you must access it through http://127.0.0.1:8000/
Below is an exmaple of how can you run the application using Python:
-
Install pipenv
pip3 install pipenv
-
Install dependencies:
pipenv install
-
Activate virtual environment:
pipenv shell
-
Run:
pipenv run python manage.py runserver
You can view the application at: http://127.0.0.1:8000/
Below is an exmaple of how can you run tests for this application:
-
With python
pipenv run python manage.py test
-
With Docker:
docker run -it -p 8000:8000 django-robots-v0.0.1 test
To scale up this porject to millions of users per day we would need different technologies and tools to make this happen.
First of all, I used Django framework for this project because is python-based and has a lot of already built-in tools that make Django quite a complete framework to use when having to deal with a lot of users. For this project use-case I have not used most of the Django features because it was very simple use-case but I chose Django with the mindset so we can built it and improve it overtime with more features coming regularly. (E.g Database, Modeling, Caching, Admin site, etc...)
Key points to take into account to scale up to millions of users per day:
- Client-side rendering: When a client makes the request to our app instead of creating the html in the server side, the server only sends the required data to the client and is the client’s browser the one that’s actually generating the html page, this frees up resources on the server so it can process more requests and it helps scale the application.
- Caching requests: Cache requests that are frequently requested so instead of actually doing the requests to the website to ge the robots.txt file we will answer from the cache, which will be stored in a temporary storage layer. This comes with the challenge on how to invalidate this caching so we would need to set up a system that every x time does a request to the wbesite and checks that the robots.txt file has not been change or getting somehow and event from a website that the robots.txt file has change and we need to ivnalidate the cache. Nonetheless I think robots.txt files do not change frequently.
- Asynchronously process requests: When a thread does a request instead of waitting for it, it will move on to the next task to do. This will allow to handle more requests and improve the responsiveness of the application. For this we would need a message broker between the client and the server so it can handle which tasks need to be done.
- Resiliency/Reliability/Availability: In production environmnets we will need resiliency so if there is any issue with the application we can fall back to another instance of the application instantly, with the use of a loadbalancer.
- Monitoring (metrics, logs): We would need to collect different kinds of metrics to help us gain insights and understand the performance and health status of the application at any given time. Also an alerting system needs to be set up so that we are alerted in case of an incident. (E.g Datadog, Prometheus, Grafana, AlertManager, etc...)
High-level architecture for this application:
It can be replicated to be in more regions and zones depending on how resilient/available we want the system to be and of course depending on the budget for this project.
To set up the previous architecture I would use terraform and terragrunt to set up infrastructure (VPCs, subnets, DNS, Internet Gateways, NAT, etc...). To deploy the actual application I would use Kubernetes with Helm charts and ArgoCD. With Kubernetes we have a cluster that is elastic and it will shrink and grow dependning on the CPU usage of the compute VMs.
- Make this app store the robots.txt file into a Database.
- Analyse and extract insights from multiple robots.txt file from multiple websites.
- Create charts and dashboars with the most banned user-agent in different regions for example.
If you have any suggestion that would make this better, fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/<feature-name>
) - Commit your Changes (
git commit -m 'Add some <feature-name>'
) - Push to the Branch (
git push origin feature/<feature-name>
) - Open a Pull Request to this project