![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
During Google Summer of Code 2016, I worked on ahmia.fi, the hidden service search engine. You are reading the complete report of my work.
The summer was productive, and it has been nice working with the Tor community. I also like that I could take the time to code things the right way.
Do not hesitate to contact me <zma@riseup.net> if you have a question about my work, or my mentor Juha Nurmi <juha.nurmi@ahmia.fi> for anything regarding the Ahmia project.
Remarks about the work done
At the time I’m writing this, the work I made is usable and no blocking bug is known.
There are a few things I still need to do, like fixing 4 pages where the HTML is not valid, or pages pointing to deprecated links, or changing the mime type of a page from text/html to text/plain.
In this document, I will present the most significant commits I did on each area of the project. I removed one-liners and typo fixes. In case you want to see the complete list of my contributions, you can do it on each repositories:
- https://github.com/ahmia/ahmia-site/commits/master?author=iriahi (2 pages)
- https://github.com/ahmia/ahmia-crawler/commits/master?author=iriahi (2 pages)
- https://github.com/ahmia/ahmia-index/commits/master?author=iriahi
What work has been done?
Refactor codebase
Repositories
The first thing I did when starting GSoC was to change the project structure.
Back then, it was a single repository which contained everything. My point was that the site and the crawler have very little in common and should be tracked separately. I split the repository in a way that kept the history thanks to this guide [1].
I also made a repository for the index because it's structure is what links the crawler and the search site, so it cannot really be part of one repository or the other.
The project now uses three repositories:
- https://github.com/ahmia/ahmia-site
- https://github.com/ahmia/ahmia-crawler
- https://github.com/ahmia/ahmia-index
[1] https://help.github.com/articles/splitting-a-subfolder-out-into-a-new-repository/
Code
The code had some technical debt: deprecated features, reference to ancient back-end (solr, django-haystack, yacy), duplicates. I removed deprecated code, used static analysis to improve code quality, change the structure of the monolithic django code into multi-apps.
The django code is composed of the following apps:
- ahmia contains root-level static pages, two forms, and old API endpoints (more about this in the next paragraph)
- search contains the app that queries elasticsearch and display results. Still very simple, but it’s good to self-contain search features, especially if we implement the search query language parser later.
- stats contain the app displaying statistics
I planned to make an api app but it would break compatibility. This is explained more in details in the “Missing work” section.
Data
To improve results relevance for a search, results need to be ranked. To rank them, some kind of popularity or “importance” value was needed. Some of these values existed already, but were stored in a SQL database. Because of this, the ranking logic could not be done by elasticsearch when making a query. It also increased the code needed on the django side of things, to make a search.
For these reasons, we chose to move all pages related data from SQL to Elasticsearch.
Commits
ahmia /ahmia-site:
7af3f92 Removing old disabled links
7d6284c Centralize the elasticsearch object creation
a5b3343 Base for Search app
fd074e5 Clean ahmia module
40c2af1 Refactor ahmia module
0fa073f Update requirements and fix deprecated stuff
bd07455 Remove admin
0a8e674 Centralize specific code into apps
e239330 Centralize global code into ahmia directory
70cbbc6 Merge pull request #2 from iriahi/master
2b54cae Improve code quality with pylint.
c3fea78 Change django project structure.
dbf12af Using previous directory structure.
ahmia /ahmia-crawler:
8d6be08 Move the filter urls logic (fake and domain format) to the LinkExtractor
0b32511 Fix pylint warnings
05e2196 Modify elasticsearch_servers to respect extension convention
ba767d0 Cleaning and documentation
16bc810 Merge pull request #1 from iriahi/master
e8bd52e Fix variable names
f9a01ce Improve code quality with pylint
85a2fc9 Move index to it's own repository ahmia-index
eb09c20 Move script to ahmia-tools directory
fea60f6 Clean directory structure and rename bot Ahmia
ahmia /ahmia-index:
2954e9e Use files from ahmia-crawler repository
4d6fac3 Initial commit
Automate things
Static analysis
Code correctness and quality can be improved thanks to static analysis. Tracking of the code quality was added to the project thanks to Landscape.io. It runs pylint each time a push is made on a repository to compute a quality score, and notify the author by email in case the score decreases because of the new code. The score of each ahmia’s repository can be seen here:
Requirements
Each projects python requirements are tracked by requires.io. It’s really useful to track insecure versions of libraries the project might be using, and keep the site secure. Each requirement lists can be seen here:
- https://requires.io/github/ahmia/ahmia-crawler/requirements/
- https://requires.io/github/ahmia/ahmia-site/requirements/
Continuous integration
Each time someone push code, it is tested thanks to travis-ci. These tests also run pylint to detect anyone pushing “bad quality code”.
Commits
ahmia /ahmia-site:
6e05195 Add code health sticker
02f7bcf Update dev requirements to secure versions
b9edb13 Add requires.io badge
6c22709 Add travis.CI sticker to the README
5cfc881 Fix travis.yml syntax
88772d4 Add travis.CI config file
ahmia /ahmia-crawler:
1e3ee4b Add travis badge
84c66a4 Fix format issue
f2bcaac Add travis.yml
81317ac Update outdated requirements
129043e Update requirements
Improve search
Index
Before I started working on this, elasticsearch was used as a document store, storing pages’ title, description, and keywords. Keywords were extracted from the content by the crawler, then passed as an array to the index. But it’s elasticsearch’s role to extract tokens from content. That’s why I improved this aspect a lot, by passing raw content and indexing it in three different ways as suggested in the elasticsearch definitive guide [2].
I also concatenated important fields like title, description, anchor text (the text used to link to a page, more about this later) into a field called fancy. The reason is that we don’t really care where a term appears when we search something, as long as it appears. This is called term-centric search [3], and in our case, we do that at index-time.
[2] https://www.elastic.co/guide/en/elasticsearch/guide/current/most-fields.html
[3] https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-all.html
Crawl
The spider has been rewritten with the goal of improving search. Here are the new features:
Since we want results to return pages instead of domains, we use the full url as the primary key. This adds the problem of indexing similar pages like different versions, mirrors or non-canonical urls. This was solved at the search step.
Spider now pass raw data to elasticsearch. Because of this, the parse code is a simpler.
The spider is also responsible to pass link information to Elasticsearch. Let say we have a page A with this link: <a href=”B”>Great search engine</a>. We want to index “Great search engine” (this is called an anchor text) with page B and not with page A, so a person looking for “search engine” will find B with a better score than A.
The spider is also responsible to compute an authority score for each page (but more of this later).
One of the challenge was that the crawler (downloading pages) was too fast compared to elasticsearch indexing speed, filling the heap memory and breaking the crawl process. This was solved by adding a download delay between each pages.
Search
There are a few things I did to improve search.
First, a query is searched in the fancy field (title + description + anchor text). Page content is not used yet because it gave too much results.
Results are then aggregated by domain name to display only one result per domain. The page displayed is the one with the best authority score, meaning the one with the most quality pages linking to it.
Finally, domains are displayed in order of decreasing authority score.
This algorithm is far from being the best but it’s the best I could come up in the time-frame I had. I give several ideas about how to do better in the end of this report.
Fake and banned domains are now filtered by elasticsearch. The code is simpler because of this.
Commits
ahmia /ahmia-site:
4411265 Fix bug when no anchors in a document
d1976e6 Tweak search query to use domains with no authority
ca8424c Correct the validator that checks if a domain is banned
9384eab Don't query documents with no authority score in search
44e8a9a Add API endpoint and tune search
4aba81d Activate search app
7bf388f Update search app to make it work with new index
ahmia /ahmia-index:
ae45b94 New index format
5ac96b4 First test for english-only crawling
ahmia /ahmia-crawler:
e312cd3 Index <body> inner html instead of the whole page
4d41cb6 Speed up pagerank computing
0ce85c3 Increase download delay to avoid overloading ES
97d4f5d Reduce request size sent to ES to improve indexing speed
a96b3a9 Modify what we add to the index when creating a doc
05e2196 Modify elasticsearch_servers to respect extension convention
8c404f8 Add url, domain, updated_on for indexed links
52d9974 Fix itemproc.process_item call
9fa9c6f Fix a bug where doc with no contents caused a crash
655eeb3 Improve crawling efficiancy and reliability
b5ec09a Fix bug where no document was indexed
9764a16 Move popularity_bot code to main bot pipelines
307c410 Add bot to compute pagerank
d2607ed Update crawler to make it work with new index format
Update documentation
Documentation was updated to make sure everybody wanting to get and run the project could do it by following instructions. Also some things like a deprecated architecture diagram and solr configuration documentation were removed.
Commits
ahmia /ahmia-site:
37ebb5d Add docstrings to views and forms
786ac94 Add a FAQ and fix manage.py path
7be7de1 Update README
2bb65a3 Add build toolchain and missing openssl headers
81d8460 Add config examples
0416a0f Update README.md
ahmia /ahmia-crawler:
79baf48 Update README with badges
3ab64fd Add a link to ahmia-index repository
63a61c0 Add a paragraph about TorBalancer in main README
9a7c023 Add a usage paragraph in the README
6f4f540 Update README
b54cabb Create index README
e9ede41 Update onionElasticBot's README
feb5932 Create onionElasticBot README
f22b5cd Use previous name for the bot
dd5f880 Add configuration and run systems to README
bda03dd Fix Java package name
dfbac4d Update README with a coherent install guide
Design
Not much work has been done in the frontend department. The only thing worth mentionning is that all pages now uses the new design.
Commits
ahmia /ahmia-site:
03f3f93 Using a compiled version of css for people using noscript
a5ff766 Fix domain list presentation
ecde96f Update description proposal to new design
What work is missing?
Statistics
In my initial proposal, one of my goal was to gather more statistics. Not much stats has been added to the collected one, and I even dropped clicks and visits stats. The stats app should be rethink from the ground up in order to be really insightful. I suggested that we could use a self-hosted statistics framework like piwik [4] but no decision has been made.
I added an authority value to each page which is a stat computed thanks to the PageRank algorithm [5]. In my case, I did use the python-igraph library’s PageRank implementation [6] to compute this score.
[5] http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
[6] http://igraph.org/python/doc/igraph.Graph-class.html#pagerank
API
One of my goal was also to make a RESTful API. I wanted to make it self-contained in a django app called ‘api’. I did not make it, and left API endpoints in the root ‘ahmia’ django app.
I also removed some API endpoints because they were domain-centric when we are now indexing pages/urls. How can I return or update the title of a domain?
I think that a RESTful API is valuable but needs to be written in a way that’s not retro-compatible. That’s why I did not start it during the summer of code.
Query language
Until the very end, I thought I had time to make this happened. The idea was to being able to tune the search query made to elasticsearch by specifying special meaning word:
-word would be generating a “not match word” filter.
“multi words query” would be generating a AND multi-match query instead of OR.
domain:mydomainsdf23254.onion would be adding a term filter with domain equals to this value.
filetype:pdf would be looking for a pdf file. Now this would be using the content_type field indexed in elasticsearch. This feature is not yet working because we only index HTML pages, but thanks to the anchor text, we could probably index documents, pictures.
lang:fr would be looking for pages in French. Either we make this by using the lang field (which is not yet populated by the crawler), or by using the specific index for the French language if the “index by language” feature is implemented.
What could be done in the future?
Index by language
Our index is currently optimized for requests made in English, even if the indexed page is not English. To solve this, I suggest that we detect document language during crawling (either by looking for a lang attribute on the html tag, or by using Google’s language detection library [7]) and using a different index for each language as suggested by the elasticsearch definitive guide [8].
[7] https://pypi.python.org/pypi/langdetect
[8] https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-docs.html
Tuning scoring algorithm
When making a search query, we have two values for each result: _score and authority.
Score is a value computed by elasticsearch linked to relevance. In our case, high score means that a page matches the query really well.
Authority is a value computed by the PageRank algorithm. In our case; high authority means a page is linked a lot by other high authority pages.
The current search algorithm I made doesn’t use the score value but only ranks by authority. In a large request like “how to operate a tor hidden service”, really relevant results with low authority (like an unknown forum post called “how to operate a hidden service”) will be in the bottom of the results list when a high authority least relevant page (like “Tor hidden wiki”) will be in the top of the results. This is not good.
One way to solve this is to use boosting. But in order to give great results, we must analyze the score function mathematically to calculate how much the boosting multiplier needs to be, and what should be the function.
A great result list would start with high relevance / high authority results, then high relevance / low authority and low relevance / high authority. Finally, low relevance / low authority should be displayed.
Elasticsearch guide gives an example about boosting by popularity [9]. I tried this approach but it wasn’t really good for heterogeneous documents. Score depends on term frequency, field size, etc. In our case we don’t really give more importance of a link called “Tor site” compared to “Everything you need to know about Tor”. But if searching “tor”, the second will have a lower score because of its length.
This is why we need to analyze scoring (computation method here [10]) and tune it.
[9] https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html
[10] https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
Better detect low distance content (or similar pages)
There’s a lot of very similar pages in the index. There’s several explanations for this:
The domain could be a mirror of another one
The domain could be a fake
The page could be the same as another url of the same domain, only one of them is the canonical url ([11])
I tested a simple way to remove similar pages at search time by aggregating them by title and using the one with the largest authority. It was simple and worked. But since I ranked the results per authority, I had too much pages of the same domain at the top of the list.
The right way to solve this is too use the html-distance between pages [12]. That way, you also can detect page that have the same content, but organized differently (like when you use a different order in a list). I believe this has to be done at crawling time.
[11] https://support.google.com/webmasters/answer/139066?hl=en
[12] https://github.com/yahoo/gryffin/tree/master/html-distance
Better understand query string url parameters
Crawlers have difficulties with url parameters.
Some of them are useful: http://domain/?page=index or http://domain/blog/?page=3
Some of them are useless: http://domain/?printable=true or http://domain/?format=rss
Either we could solve this by trying to guess parameters meaning (ex: page or p is for pagination and changes content, format or printable does not), or by using the html-distance thing.
Improve crawling speed by ignoring unchanged content
There are several ways to achieve this:
The HTTP header “If-modified-since” could be used to avoid parsing unchanged content.
Change’s frequency of a page could be tracked. What was the last time something changed? On a longer time-frame, is it changing daily/weekly/monthly/never? Why crawl every day something that changes one time per month?
Understand composed words query
I had a conversation with my mentor where he was telling me that one of his test was to query “duck duck go” in Ahmia. It made me think: The well-known search engine is called “DuckDuckGo”. Elasticsearch indexes that as one term and doesn’t return it as a result when it sees a request with term duck and go.
We need to find a way to handle this. It’s doable by using what we call Ngrams which are used for two things: suggestions-as-you-type [13] and compound words [14] like the word Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz in German (which means “The law concerning the delegation of duties for the supervision of cattle marking and the labeling of beef”).
[13] https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
[14] https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
What now?
I plan to continue working on the project when having time to do so. Ahmia could be a much greater search engine if we implemented all the ideas we have. Only time is missing.