[personal profile] zma0


During Google Summer of Code 2016, I worked on ahmia.fi, the hidden service search engine. You are reading the complete report of my work.

The summer was productive, and it has been nice working with the Tor community. I also like that I could take the time to code things the right way.

Do not hesitate to contact me <zma@riseup.net> if you have a question about my work, or my mentor Juha Nurmi <juha.nurmi@ahmia.fi> for anything regarding the Ahmia project.

Remarks about the work done

At the time I’m writing this, the work I made is usable and no blocking bug is known.

There are a few things I still need to do, like fixing 4 pages where the HTML is not valid, or pages pointing to deprecated links, or changing the mime type of a page from text/html to text/plain.

In this document, I will present the most significant commits I did on each area of the project. I removed one-liners and typo fixes. In case you want to see the complete list of my contributions, you can do it on each repositories:

What work has been done?

Refactor codebase

Repositories

The first thing I did when starting GSoC was to change the project structure.

Back then, it was a single repository which contained everything. My point was that the site and the crawler have very little in common and should be tracked separately. I split the repository in a way that kept the history thanks to this guide [1].

I also made a repository for the index because it's structure is what links the crawler and the search site, so it cannot really be part of one repository or the other.

The project now uses three repositories:

[1] https://help.github.com/articles/splitting-a-subfolder-out-into-a-new-repository/

Code

The code had some technical debt: deprecated features, reference to ancient back-end (solr, django-haystack, yacy), duplicates. I removed deprecated code, used static analysis to improve code quality, change the structure of the monolithic django code into multi-apps.

The django code is composed of the following apps:

  • ahmia contains root-level static pages, two forms, and old API endpoints (more about this in the next paragraph)
  • search contains the app that queries elasticsearch and display results. Still very simple, but it’s good to self-contain search features, especially if we implement the search query language parser later.
  • stats contain the app displaying statistics

I planned to make an api app but it would break compatibility. This is explained more in details in the “Missing work” section.

Data

To improve results relevance for a search, results need to be ranked. To rank them, some kind of popularity or “importance” value was needed. Some of these values existed already, but were stored in a SQL database. Because of this, the ranking logic could not be done by elasticsearch when making a query. It also increased the code needed on the django side of things, to make a search.

For these reasons, we chose to move all pages related data from SQL to Elasticsearch.

Commits

ahmia /ahmia-site:

7af3f92 Removing old disabled links

7d6284c Centralize the elasticsearch object creation

a5b3343 Base for Search app

fd074e5 Clean ahmia module

40c2af1 Refactor ahmia module

0fa073f Update requirements and fix deprecated stuff

bd07455 Remove admin

0a8e674 Centralize specific code into apps

e239330 Centralize global code into ahmia directory

70cbbc6 Merge pull request #2 from iriahi/master

2b54cae Improve code quality with pylint.

c3fea78 Change django project structure.

dbf12af Using previous directory structure.

ahmia /ahmia-crawler:

8d6be08 Move the filter urls logic (fake and domain format) to the LinkExtractor

0b32511 Fix pylint warnings

05e2196 Modify elasticsearch_servers to respect extension convention

ba767d0 Cleaning and documentation

16bc810 Merge pull request #1 from iriahi/master

e8bd52e Fix variable names

f9a01ce Improve code quality with pylint

85a2fc9 Move index to it's own repository ahmia-index

eb09c20 Move script to ahmia-tools directory

fea60f6 Clean directory structure and rename bot Ahmia

ahmia /ahmia-index:

2954e9e Use files from ahmia-crawler repository

4d6fac3 Initial commit

Automate things

Static analysis

Code correctness and quality can be improved thanks to static analysis. Tracking of the code quality was added to the project thanks to Landscape.io. It runs pylint each time a push is made on a repository to compute a quality score, and notify the author by email in case the score decreases because of the new code. The score of each ahmia’s repository can be seen here:

Requirements

Each projects python requirements are tracked by requires.io. It’s really useful to track insecure versions of libraries the project might be using, and keep the site secure. Each requirement lists can be seen here:

Continuous integration

Each time someone push code, it is tested thanks to travis-ci. These tests also run pylint to detect anyone pushing “bad quality code”.

Commits

ahmia /ahmia-site:

6e05195 Add code health sticker

02f7bcf Update dev requirements to secure versions

b9edb13 Add requires.io badge

6c22709 Add travis.CI sticker to the README

5cfc881 Fix travis.yml syntax

88772d4 Add travis.CI config file

ahmia /ahmia-crawler:

1e3ee4b Add travis badge

84c66a4 Fix format issue

f2bcaac Add travis.yml

81317ac Update outdated requirements

129043e Update requirements

Improve search

Index

Before I started working on this, elasticsearch was used as a document store, storing pages’ title, description, and keywords. Keywords were extracted from the content by the crawler, then passed as an array to the index. But it’s elasticsearch’s role to extract tokens from content. That’s why I improved this aspect a lot, by passing raw content and indexing it in three different ways as suggested in the elasticsearch definitive guide [2].

I also concatenated important fields like title, description, anchor text (the text used to link to a page, more about this later) into a field called fancy. The reason is that we don’t really care where a term appears when we search something, as long as it appears. This is called term-centric search [3], and in our case, we do that at index-time.

[2] https://www.elastic.co/guide/en/elasticsearch/guide/current/most-fields.html

[3] https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-all.html

Crawl

The spider has been rewritten with the goal of improving search. Here are the new features:

Since we want results to return pages instead of domains, we use the full url as the primary key. This adds the problem of indexing similar pages like different versions, mirrors or non-canonical urls. This was solved at the search step.

Spider now pass raw data to elasticsearch. Because of this, the parse code is a simpler.

The spider is also responsible to pass link information to Elasticsearch. Let say we have a page A with this link: <a href=”B”>Great search engine</a>. We want to index “Great search engine” (this is called an anchor text) with page B and not with page A, so a person looking for “search engine” will find B with a better score than A.

The spider is also responsible to compute an authority score for each page (but more of this later).

One of the challenge was that the crawler (downloading pages) was too fast compared to elasticsearch indexing speed, filling the heap memory and breaking the crawl process. This was solved by adding a download delay between each pages.

Search

There are a few things I did to improve search.

First, a query is searched in the fancy field (title + description + anchor text). Page content is not used yet because it gave too much results.

Results are then aggregated by domain name to display only one result per domain. The page displayed is the one with the best authority score, meaning the one with the most quality pages linking to it.

Finally, domains are displayed in order of decreasing authority score.

This algorithm is far from being the best but it’s the best I could come up in the time-frame I had. I give several ideas about how to do better in the end of this report.

Fake and banned domains are now filtered by elasticsearch. The code is simpler because of this.

Commits

ahmia /ahmia-site:

4411265 Fix bug when no anchors in a document

d1976e6 Tweak search query to use domains with no authority

ca8424c Correct the validator that checks if a domain is banned

9384eab Don't query documents with no authority score in search

44e8a9a Add API endpoint and tune search

4aba81d Activate search app

7bf388f Update search app to make it work with new index

ahmia /ahmia-index:

ae45b94 New index format

5ac96b4 First test for english-only crawling

ahmia /ahmia-crawler:

e312cd3 Index <body> inner html instead of the whole page

4d41cb6 Speed up pagerank computing

0ce85c3 Increase download delay to avoid overloading ES

97d4f5d Reduce request size sent to ES to improve indexing speed

a96b3a9 Modify what we add to the index when creating a doc

05e2196 Modify elasticsearch_servers to respect extension convention

8c404f8 Add url, domain, updated_on for indexed links

52d9974 Fix itemproc.process_item call

9fa9c6f Fix a bug where doc with no contents caused a crash

655eeb3 Improve crawling efficiancy and reliability

b5ec09a Fix bug where no document was indexed

9764a16 Move popularity_bot code to main bot pipelines

307c410 Add bot to compute pagerank

d2607ed Update crawler to make it work with new index format

Update documentation

Documentation was updated to make sure everybody wanting to get and run the project could do it by following instructions. Also some things like a deprecated architecture diagram and solr configuration documentation were removed.

Commits

ahmia /ahmia-site:

37ebb5d Add docstrings to views and forms

786ac94 Add a FAQ and fix manage.py path

7be7de1 Update README

2bb65a3 Add build toolchain and missing openssl headers

81d8460 Add config examples

0416a0f Update README.md

ahmia /ahmia-crawler:

79baf48 Update README with badges

3ab64fd Add a link to ahmia-index repository

63a61c0 Add a paragraph about TorBalancer in main README

9a7c023 Add a usage paragraph in the README

6f4f540 Update README

b54cabb Create index README

e9ede41 Update onionElasticBot's README

feb5932 Create onionElasticBot README

f22b5cd Use previous name for the bot

dd5f880 Add configuration and run systems to README

bda03dd Fix Java package name

dfbac4d Update README with a coherent install guide

Design

Not much work has been done in the frontend department. The only thing worth mentionning is that all pages now uses the new design.

Commits

ahmia /ahmia-site:

03f3f93 Using a compiled version of css for people using noscript

a5ff766 Fix domain list presentation

ecde96f Update description proposal to new design

What work is missing?

Statistics

In my initial proposal, one of my goal was to gather more statistics. Not much stats has been added to the collected one, and I even dropped clicks and visits stats. The stats app should be rethink from the ground up in order to be really insightful. I suggested that we could use a self-hosted statistics framework like piwik [4] but no decision has been made.

I added an authority value to each page which is a stat computed thanks to the PageRank algorithm [5]. In my case, I did use the python-igraph library’s PageRank implementation [6] to compute this score.

[4] https://piwik.org/

[5] http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

[6] http://igraph.org/python/doc/igraph.Graph-class.html#pagerank

API

One of my goal was also to make a RESTful API. I wanted to make it self-contained in a django app called ‘api’. I did not make it, and left API endpoints in the root ‘ahmia’ django app.

I also removed some API endpoints because they were domain-centric when we are now indexing pages/urls. How can I return or update the title of a domain?

I think that a RESTful API is valuable but needs to be written in a way that’s not retro-compatible. That’s why I did not start it during the summer of code.

Query language

Until the very end, I thought I had time to make this happened. The idea was to being able to tune the search query made to elasticsearch by specifying special meaning word:

-word would be generating a “not match word” filter.

“multi words query” would be generating a AND multi-match query instead of OR.
domain:mydomainsdf23254.onion would be adding a term filter with domain equals to this value.

filetype:pdf would be looking for a pdf file. Now this would be using the content_type field indexed in elasticsearch. This feature is not yet working because we only index HTML pages, but thanks to the anchor text, we could probably index documents, pictures.

lang:fr would be looking for pages in French. Either we make this by using the lang field (which is not yet populated by the crawler), or by using the specific index for the French language if the “index by language” feature is implemented.

What could be done in the future?

Index by language

Our index is currently optimized for requests made in English, even if the indexed page is not English. To solve this, I suggest that we detect document language during crawling (either by looking for a lang attribute on the html tag, or by using Google’s language detection library [7]) and using a different index for each language as suggested by the elasticsearch definitive guide [8].

[7] https://pypi.python.org/pypi/langdetect

[8] https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-docs.html

Tuning scoring algorithm

When making a search query, we have two values for each result: _score and authority.

Score is a value computed by elasticsearch linked to relevance. In our case, high score means that a page matches the query really well.

Authority is a value computed by the PageRank algorithm. In our case; high authority means a page is linked a lot by other high authority pages.

The current search algorithm I made doesn’t use the score value but only ranks by authority. In a large request like “how to operate a tor hidden service”, really relevant results with low authority (like an unknown forum post called “how to operate a hidden service”) will be in the bottom of the results list when a high authority least relevant page (like “Tor hidden wiki”) will be in the top of the results. This is not good.

One way to solve this is to use boosting. But in order to give great results, we must analyze the score function mathematically to calculate how much the boosting multiplier needs to be, and what should be the function.

A great result list would start with high relevance / high authority results, then high relevance / low authority and low relevance / high authority. Finally, low relevance / low authority should be displayed.

Elasticsearch guide gives an example about boosting by popularity [9]. I tried this approach but it wasn’t really good for heterogeneous documents. Score depends on term frequency, field size, etc. In our case we don’t really give more importance of a link called “Tor site” compared to “Everything you need to know about Tor”. But if searching “tor”, the second will have a lower score because of its length.

This is why we need to analyze scoring (computation method here [10]) and tune it.

[9] https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

[10] https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Better detect low distance content (or similar pages)

There’s a lot of very similar pages in the index. There’s several explanations for this:

The domain could be a mirror of another one

The domain could be a fake

The page could be the same as another url of the same domain, only one of them is the canonical url ([11])

I tested a simple way to remove similar pages at search time by aggregating them by title and using the one with the largest authority. It was simple and worked. But since I ranked the results per authority, I had too much pages of the same domain at the top of the list.

The right way to solve this is too use the html-distance between pages [12]. That way, you also can detect page that have the same content, but organized differently (like when you use a different order in a list). I believe this has to be done at crawling time.

[11] https://support.google.com/webmasters/answer/139066?hl=en

[12] https://github.com/yahoo/gryffin/tree/master/html-distance

Better understand query string url parameters

Crawlers have difficulties with url parameters.

Some of them are useful: http://domain/?page=index or http://domain/blog/?page=3

Some of them are useless: http://domain/?printable=true or http://domain/?format=rss

Either we could solve this by trying to guess parameters meaning (ex: page or p is for pagination and changes content, format or printable does not), or by using the html-distance thing.

Improve crawling speed by ignoring unchanged content

There are several ways to achieve this:

The HTTP header “If-modified-since” could be used to avoid parsing unchanged content.

Change’s frequency of a page could be tracked. What was the last time something changed? On a longer time-frame, is it changing daily/weekly/monthly/never? Why crawl every day something that changes one time per month?

Understand composed words query

I had a conversation with my mentor where he was telling me that one of his test was to query “duck duck go” in Ahmia. It made me think: The well-known search engine is called “DuckDuckGo”. Elasticsearch indexes that as one term and doesn’t return it as a result when it sees a request with term duck and go.

We need to find a way to handle this. It’s doable by using what we call Ngrams which are used for two things: suggestions-as-you-type [13] and compound words [14] like the word Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz in German (which means “The law concerning the delegation of duties for the supervision of cattle marking and the labeling of beef”).

[13] https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html

[14] https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html

What now?

I plan to continue working on the project when having time to do so. Ahmia could be a much greater search engine if we implemented all the ideas we have. Only time is missing.

Powered by Dreamwidth Studios