Project

General

Profile

Feature #8640

Bug #7161: Support more than 24 HTTP mirrors

Have the mirror pool dispatcher library audited

Added by intrigeri almost 4 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
05/10/2016
Due date:
% Done:

100%

QA Check:
Pass
Feature Branch:
451f:mirror-pool-dispatcher/master
Type of work:
Communicate
Blueprint:
Starter:
Affected tool:

Subtasks

Bug #11405: Test mirror pool dispatcher script in different browsersResolved


Related issues

Blocked by Tails - Feature #8639: Write a mirror pool dispatcher script Resolved 01/09/2015 04/15/2016

History

#1 Updated by intrigeri almost 4 years ago

  • Blocked by Feature #8639: Write a mirror pool dispatcher script added

#2 Updated by intrigeri almost 4 years ago

  • Blocks Feature #8641: Have the mirror pool dispatcher script deployed on tails.b.o added

#3 Updated by Dr_Whax almost 4 years ago

I can help out with this, maybe jvoisin is also interested in this.

#4 Updated by intrigeri almost 4 years ago

I can help out with this, maybe jvoisin is also interested in this.

Cool. Would be good to find someone who has a good knowledge of potential security pitfalls in PHP.

#6 Updated by intrigeri over 3 years ago

  • Target version changed from Sustainability_M1 to Tails_2.2

#7 Updated by intrigeri about 3 years ago

  • Due date set to 04/15/2016

#8 Updated by u about 3 years ago

  • Subject changed from Have the server-side mirror pool dispatcher script audited to Have the mirror pool dispatcher script audited

We decided to implement this feature differently, changing the title accordingly.

#9 Updated by intrigeri over 2 years ago

  • Blocks deleted (Feature #8641: Have the mirror pool dispatcher script deployed on tails.b.o)

#10 Updated by u over 2 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#13 Updated by geb over 2 years ago

Hi,

The code looks great for me. If I may, I would like to do a few suggests, most (if not any) of them are cosmetic ones

  • max_weight, url_fallback and mirrors.json could be global variable wrote in CAPS on top of the code for readability.
  • getRandomMirrorUrlPrefix() could be split in two functions, one that generate the new array, one that select it.
  • get(path) could maybe be avoided by using <script src="mirrors.json"> and a mirrors = { } within this file.
  • Maybe it would have been simplier to use onclick= stategy instead of rewriting the DOM, something like:
<a href="http://dl.amnesia.boum.org/tails/stable/tails-i386-1.6/tails-i386-1.6.iso" onclick="javascript:mirror();">

function mirror (){
  document.location = SelectMirror();
  return false;
}

Please remind those notes are just suggests, I know i am a bit late, and anyway the code looks fine. Thanks for your work on it :)

#14 Updated by intrigeri over 2 years ago

The code looks great for me. If I may, I would like to do a few suggests, most (if not any) of them are cosmetic ones

Thanks! Let's follow up on #8639, since this ticket (#8640) is about a security audit.

#15 Updated by co6 over 2 years ago

Hi,
here are the results of my security audit:

The entry point for external data is the JSON file retrieved through a GET XMLHttpRequest.

  • XMLHttpRequest() gets a file with an hardcoded relative path on the same origin: /example-mirrors.json
    => this means it relies on the security set-up on the server side (SSL/TLS is set)
  • In case of any unexpected error, the fallback URL is hardcoded to a secure location.
    => ok
Attack scenarii:
  • if the JSON file is hosted on the same origin than the HTML/JS code retrieving it, it's likely that the server/directory is compromised, so there is no point in validating the JSON data since the HTML code can be modified directly.
  • However, if it is NOT hosted on the same origin than the HTML/JS code retrieving it or if somehow an attacker exploits a vulnerability of the website allowing them to inject JSON data:
    - we should be protected against code injection: JSON.parse() should protect from code injection (escaping HTML and JS);
    - there is no data validation of url_prefix: are there secure (httpS)? are there even URLs values?
    => see recommendations
    - protecting against denial of service: need to check the size of the JSON file in order to avoid infinite "for" loops (number of url_prefixes).
    => see recommendations
Recommendations:
  • checking JSON data is served securely via TLS
  • validating url_prefix values:
    • check it is not an empty string
    • HTTPS: more a protection against non-malicious mistakes rather than from an attacker since an attacker can inject a malicious https:// URL directly
    • URLs values: idem
  • checking maximum boundaries for data.mirrors.length and linkelements.length
Misc:
  • Error handling: if the 'href' attribute is empty (i.e if url_current is an empty string), the url_prefix is inserted and the page provides an incomplete URL for downloading.
    => should it render an empty link instead?

#16 Updated by u over 2 years ago

  • % Done changed from 0 to 10
  • Feature Branch set to 451f:mirror-pool-dispatcher/master

I've modified the script quite a lot to take into account all those recommendations.

Once I've tried this with DAVE too I'll ask for a new review.

#17 Updated by u over 2 years ago

  • Assignee changed from u to intrigeri
  • QA Check set to Ready for QA

I'd be glad if you could have a first view over those modifications, intrigeri. When done, either reassign this to me or to co6, depending on your judgement.

As said over another channel, the problem for me at the moment remains the RegExp in isValidURL(), which has one false positive for URLs of this type: `https://bla.com/stable/thingy.iso<script>alert('bla');</script>`

#18 Updated by u over 2 years ago

co6 wrote:

here are the results of my security audit:

thank you so much again! your instructions are very clear :)

Recommendations:
  • checking JSON data is served securely via TLS

doing this via the HTTP header of the XML request

  • validating url_prefix values:
    • check it is not an empty string

done

  • HTTPS: more a protection against non-malicious mistakes rather than from an attacker since an attacker can inject a malicious https:// URL directly

I did not implement this because not all mirrors will provide HTTPS. However, for our Firefox extension, where we will use parts of this script too, we will need to check that there are no previous downloads over HTTPS which will get resumed over HTTP.

  • URLs values: idem

done

  • checking maximum boundaries for data.mirrors.length and linkelements.length

done by adding a maximum expected filesize for the JSON and a max expected number of links for linkelements.

Misc:
  • Error handling: if the 'href' attribute is empty (i.e if url_current is an empty string), the url_prefix is inserted and the page provides an incomplete URL for downloading.
    => should it render an empty link instead?

In that case the link stays empty and unmodified. I did the same with links which do not appear to be valid URLs.

#19 Updated by intrigeri over 2 years ago

Attack scenarii:
  • if the JSON file is hosted on the same origin than the HTML/JS code retrieving it, it's likely that the server/directory is compromised, so there is no point in validating the JSON data since the HTML code can be modified directly.

FWIW, in the initial deployment, we'll be in that situation, but we already have plans to maybe change this some day (e.g. to allow a diffent set of people to manage our pool of mirrors, than the ones who have commit access to the Tails website), so I think it's worth it to consider the other case:

  • However, if it is NOT hosted on the same origin than the HTML/JS code retrieving it or if somehow an attacker exploits a vulnerability of the website allowing them to inject JSON data:
    [...]
    => see recommendations

... so, thanks for looking at this other case too, and thanks for the recommandations!

#20 Updated by intrigeri over 2 years ago

However, for our Firefox extension, where we will use parts of this script too, we will need to check that there are no previous downloads over HTTPS which will get resumed over HTTP.

Good catch! If that's not tracked on #11109 yet, then I suggest creating a subtask of that ticket to make sure we don't forget about it.

#21 Updated by intrigeri over 2 years ago

  • Assignee changed from intrigeri to u
  • Target version changed from Tails_2.3 to Tails_2.4
  • QA Check changed from Ready for QA to Dev Needed

I'd be glad if you could have a first view over those modifications, intrigeri.

Sure!

@co6: thanks a lot for the professional-looking audit! This kind of input is very valuable here, and to be honest we rarely, if ever, got anything like this (scary, yes). I think you should wait for u to act on my review below before you take another look at the code, so I'll let her notify you when she feels it's time. I think we'll try to deploy all this in May.

Preamble: I'm totally a newbie at JavaScript, so it's somewhat futile that I attempt code reviews in this language ⇒ please take all that follows with a grain of salt. Now, I'm happy to learn so I will be glad if you tell me why I'm wrong, if you can+want to :)

  • It's good to see this piece of code trust less whatever data it is fed with. Great job! :)
  • I've merged your branch into the "official" repo, so that your work is visible publicly (fixing your own Git repo is WIP, I've reached out to the server admins about it today).
  • I think that 1c7bc1d1a2c3cb811623e7eb01e5bc7459a60080 is incorrect. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/assign for a good explanation of what Object.assign does.
  • Maybe the url_pattern regexp should be anchored at then end (with $ or equivalent) as well?
  • The part of 43b2cf85ec62afc3849c1979fb33f3ebc3b7724a that modifies isValidURL is not needed, given the regexp won't presumably allow empty strings anyway. I see you have a test case for this, so this should be easy to check :)
  • Regarding 28305c62d73bc4ead1ad7c56ed54438ba7621d19: it's good, but I'm worried of what happens if filesize > max_expected_filesize. We'll be doing reject(Error(request.statusText)), but then the error won't make sense (best case) as the request itself will have succeeded, right? I suggest we handle separatelly the non-200 error code, and the "too big retrieved content" situations, so that we can handle each problematic case sensibly.
  • Regarding ce3a9f41b48f12f0ffe5bdd0e1083318aaaad6fa: I agree we should not return false if we hit a badly formed link, but perhaps it would be nice to log something whenever this happens, instead of silently skipping it? This might be helpful to debug broken links we add to our website.
  • Regarding 5e21fc787f43b9f1dd3feebc63485326c8867871 and 59da88159e6e7604cbf1e5375336c276574144dc: maybe log something?
  • Regarding 90c15c1274dff0190bca6e4e71cf3614866069cf: the short form if statement (without curly braces) is very error-prone, see e.g. http://www.dwheeler.com/essays/apple-goto-fail.html. I strongly recommend against using it, in any language that provides this form, including JS.
  • How about splitting test cases between the links that are supposed to be modified, and the other ones? (Just a random idea, if you think it's wrong or not worth it, don't even bother explaining me why :)
  • I'm not sure about 941f4e91959677eb5c37848957176405fda6abdc: the whole point of the exercise is to protect against the web server; so I'd rather not assume that something will force it to serve a response body that is the same size as the Content-Length header it provided us. In short, with this commit it feels like the check becomes "are you nasty?" and we take a "no" for a valid answer without checking further. Did I get it wrong, or should we just revert this commit to be on the safe side? (And then, 7a5dca92282f38dba1ea377d9e036da17af036f5 might need to be reverted or adjusted as well.)
  • About e6d00e72d4c17364d5aecb2be733151b7d51aaeb: just curious, why?
  • About 517ccb6e648aed058e1892f63187551ce0bf61ef: so we're doing the regexp matching operation twice, once for logging and once for returning. Maybe factorize these two calls?
  • Regarding "checking JSON data is served securely via TLS": is checking that we see a Strict-Transport-Security header really good enough? Presumably, a web server could send this header over plaintext HTTP (that's valid, although clients are supposed to ignore such headers over HTTP iirc). I hope there's another, easy way to check what scheme was used to complete a request.

Cheers!

#22 Updated by u over 2 years ago

intrigeri wrote:

I'd be glad if you could have a first view over those modifications, intrigeri.

Sure!

@co6: thanks a lot for the professional-looking audit! This kind of input is very valuable here, and to be honest we rarely, if ever, got anything like this (scary, yes). I think you should wait for u to act on my review below before you take another look at the code, so I'll let her notify you when she feels it's time. I think we'll try to deploy all this in May.

Preamble: I'm totally a newbie at JavaScript, so it's somewhat futile that I attempt code reviews in this language ⇒ please take all that follows with a grain of salt. Now, I'm happy to learn so I will be glad if you tell me why I'm wrong, if you can+want to :)

  • It's good to see this piece of code trust less whatever data it is fed with. Great job! :)
  • I've merged your branch into the "official" repo, so that your work is visible publicly (fixing your own Git repo is WIP, I've reached out to the server admins about it today).
  • I think that 1c7bc1d1a2c3cb811623e7eb01e5bc7459a60080 is incorrect. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/assign for a good explanation of what Object.assign does.

Modified this comment in 7ac24f5

  • Maybe the url_pattern regexp should be anchored at then end (with $ or equivalent) as well?

I am not sure to understand your suggestion.

  • The part of 43b2cf85ec62afc3849c1979fb33f3ebc3b7724a that modifies isValidURL is not needed, given the regexp won't presumably allow empty strings anyway. I see you have a test case for this, so this should be easy to check :)

correct. fixed in cc157d7

  • Regarding 28305c62d73bc4ead1ad7c56ed54438ba7621d19: it's good, but I'm worried of what happens if filesize > max_expected_filesize. We'll be doing reject(Error(request.statusText)), but then the error won't make sense (best case) as the request itself will have succeeded, right? I suggest we handle separatelly the non-200 error code, and the "too big retrieved content" situations, so that we can handle each problematic case sensibly.

ack. see 24117d0

  • Regarding ce3a9f41b48f12f0ffe5bdd0e1083318aaaad6fa: I agree we should not return false if we hit a badly formed link, but perhaps it would be nice to log something whenever this happens, instead of silently skipping it? This might be helpful to debug broken links we add to our website.

ack done in 59ce7ea

  • Regarding 5e21fc787f43b9f1dd3feebc63485326c8867871 and 59da88159e6e7604cbf1e5375336c276574144dc: maybe log something?

Done in a8c945e

  • Regarding 90c15c1274dff0190bca6e4e71cf3614866069cf: the short form if statement (without curly braces) is very error-prone, see e.g. http://www.dwheeler.com/essays/apple-goto-fail.html. I strongly recommend against using it, in any language that provides this form, including JS.

ack. modified already through another commit.

  • How about splitting test cases between the links that are supposed to be modified, and the other ones? (Just a random idea, if you think it's wrong or not worth it, don't even bother explaining me why :)

Sure. done in 15f4d23

  • About e6d00e72d4c17364d5aecb2be733151b7d51aaeb: just curious, why?

As the comment says: "To enable console logging in production set DEBUG to true". In a production environment you likely do not want to have console logging enabled and this piece of code can turn it on and off.

  • About 517ccb6e648aed058e1892f63187551ce0bf61ef: so we're doing the regexp matching operation twice, once for logging and once for returning. Maybe factorize these two calls?

done in 82d0c41

Left to correct:
----------------

  • Regarding "checking JSON data is served securely via TLS": is checking that we see a Strict-Transport-Security header really good enough? Presumably, a web server could send this header over plaintext HTTP (that's valid, although clients are supposed to ignore such headers over HTTP iirc). I hope there's another, easy way to check what scheme was used to complete a request.

I agree, and I wanted to make this verification better too.

  • I'm not sure about 941f4e91959677eb5c37848957176405fda6abdc: the whole point of the exercise is to protect against the web server; so I'd rather not assume that something will force it to serve a response body that is the same size as the Content-Length header it provided us. In short, with this commit it feels like the check becomes "are you nasty?" and we take a "no" for a valid answer without checking further. Did I get it wrong, or should we just revert this commit to be on the safe side? (And then, 7a5dca92282f38dba1ea377d9e036da17af036f5 might need to be reverted or adjusted as well.)

Totally agreed. I'm completely aware of the HTTP header injection possibility. However I need to find another method to retrieve the filesize correctly.

#23 Updated by u over 2 years ago

  • Assignee changed from u to intrigeri
  • QA Check changed from Dev Needed to Ready for QA
  • Regarding "checking JSON data is served securely via TLS": is checking that we see a Strict-Transport-Security header really good enough? Presumably, a web server could send this header over plaintext HTTP (that's valid, although clients are supposed to ignore such headers over HTTP iirc). I hope there's another, easy way to check what scheme was used to complete a request.

I've done it in a different way now, simply by looking at the retrieved file path. This could be improved by checking if the domain is tails.boum.org, what do you think?

  • I'm not sure about 941f4e91959677eb5c37848957176405fda6abdc: the whole point of the exercise is to protect against the web server; so I'd rather not assume that something will force it to serve a response body that is the same size as the Content-Length header it provided us. In short, with this commit it feels like the check becomes "are you nasty?" and we take a "no" for a valid answer without checking further. Did I get it wrong, or should we just revert this commit to be on the safe side? (And then, 7a5dca92282f38dba1ea377d9e036da17af036f5 might need to be reverted or adjusted as well.)

Totally agreed. I'm completely aware of the HTTP header injection possibility. However I need to find another method to retrieve the filesize correctly.

For that one I have not yet found a solution. The only way to check the filesize seems to be the HTTP header. Apparently it's not even reliable, when servers serve gzipped files. I've not deleted this piece of code yet though because I'd prefer we find another way to do it.

#24 Updated by intrigeri over 2 years ago

  • Blocks Feature #8642: Enable the mirror pool dispatcher on all website pages that need it added

#26 Updated by intrigeri over 2 years ago

hey,

Great work!

(Below I'm snipping basically everything since I'm fully happy with the way you handled it :)

  • Maybe the url_pattern regexp should be anchored at then end (with $ or equivalent) as well?

I am not sure to understand your suggestion.

Without $ at the end of the regexp, it will allow the string to contain any non-matching char at the end. If we want the full string to match the regexp, and not merely a prefix of the string to match the regexp, then we need to specify that the string can't contain anything that the regexp does not match. One way to do it is to add $ at the end of the regexp.

#27 Updated by intrigeri over 2 years ago

  • Assignee changed from intrigeri to u
  • QA Check changed from Ready for QA to Dev Needed
  • Regarding "checking JSON data is served securely via TLS": is checking that we see a Strict-Transport-Security header really good enough? Presumably, a web server could send this header over plaintext HTTP (that's valid, although clients> are supposed to ignore such headers over HTTP iirc). I hope there's another, easy way to check what scheme was used to complete a request.

I've done it in a different way now, simply by looking at the retrieved file path. This could be improved by checking if the domain is tails.boum.org, what do you think?

This looks mostly good to me!

The only problem I can think of is that this only checks "the final URL obtained after any redirects", so if there's a cleartext HTTP URL in the sequence of redirects from the initially requested URL but the final one is HTTPS, then we won't catch it, although the overall security level of the request will be HTTP. Too bad XMLHttpRequest doesn't seem to allow us to specificy the protocols we are OK with AFAICT (I'm spoiled with LWP apparently ;) ⇒ with my non-existing JS skills I see no way to do that in a safer way. So at this point, I think we should ask co6 if it's good enough or not, and if not, if she has any suggestion wrt. how to deal with it. I'll let you handle this part of the auditing process.

  • I'm not sure about 941f4e91959677eb5c37848957176405fda6abdc: [...]

Totally agreed. I'm completely aware of the HTTP header injection possibility. [...]

For that one I have not yet found a solution. The only way to check the filesize seems to be the HTTP header. Apparently it's not even reliable, when servers serve gzipped files.

How about checking the length of the response text (https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/response) before we pass it to JSON.parse? We probably can't do that in our get function, and have to do it on the consumer's side instead, which feels slightly wrong in terms of responsibility splitting, but it's the way it has to be with a check done on the output of an async request, I guess.

While we're at it: it seems that the response can possibly have multiple types (https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/responseType), so perhaps we should check that it indeed has the type we expect (presumably "text") before passing it to JSON.parse?

#28 Updated by u over 2 years ago

  • Assignee changed from u to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

Please see updated branch for implementation of the proposed modifications. I am quite happy now :)

#29 Updated by intrigeri over 2 years ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to u
  • % Done changed from 10 to 30
  • QA Check changed from Ready for QA to Info Needed

u wrote:

Please see updated branch for implementation of the proposed modifications. I am quite happy now :)

Cool. Everything looks great. Remaining questions:

  • "But not the string length but the number of lines in that file." <-- why? This feels a bit overly complicated to me, compared to simply checking the length of the string, but perhaps I missed the advantage this approach has?
  • Why do we do request.responseType != "" twice?
  • Wrt. commit 0e2d0b6:
    • Given the complexity of the regexp, and the absence of comments, "Improve URL validation regexp" is a bit scarse a commit message. Next time, please clarify what exact problem this solves, so it's easier for me to evaluate the corresponding solution :) Too bad JS apparently provides no "x" regexp flag, that would allow us to add space and inline comments.
    • Why do we need the "g" flag?
    • I fail to understand what good the added parenthesis around (http|https):\/\/ do. Keep in mind that every pair of capturing parenthesis has some performance impact.

#30 Updated by u over 2 years ago

  • Assignee changed from u to intrigeri
  • QA Check changed from Info Needed to Ready for QA

intrigeri wrote:

u wrote:

Please see updated branch for implementation of the proposed modifications. I am quite happy now :)

Cool. Everything looks great. Remaining questions:

  • "But not the string length but the number of lines in that file." <-- why? This feels a bit overly complicated to me, compared to simply checking the length of the string, but perhaps I missed the advantage this approach has?

Earlier today you said to me 'rephrased: check that the length you get matches what you can see with "wc mirrors.json"'. Then that's how i did it. Depending on the length of domain names, this string could be quite long or quite short, we don't know. It's not a reliable measure in both cases, simply a means to not get anything nasty.

  • Why do we do request.responseType != "" twice?

That's a mistake, corrected now.

  • Wrt. commit 0e2d0b6:
    • Given the complexity of the regexp, and the absence of comments, "Improve URL validation regexp" is a bit scarse a commit message. Next time, please clarify what exact problem this solves, so it's easier for me to evaluate the corresponding solution :)

(I do notice that smiley in the end :))
I agree that this commit message was not very detailed.

Too bad JS apparently provides no "x" regexp flag, that would allow us to add space and inline comments.
  • Why do we need the "g" flag?

I think we actually don't. It's "global".

  • I fail to understand what good the added parenthesis around (http|https):\/\/ do. Keep in mind that every pair of capturing parenthesis has some performance impact.

I tried this regexp in a tool which validates JS regexps (https://github.com/gskinner/regexr/) and that's the suggestion I got. My initial regexp did not work in there.

By the way, I believe I should test the whole thing in some horrible Windows browsers too. Creating a subticket for that.

#31 Updated by intrigeri over 2 years ago

  • Assignee changed from intrigeri to u
  • QA Check changed from Ready for QA to Dev Needed

Hi!

intrigeri wrote:

u wrote:

  • "But not the string length but the number of lines in that file." <-- why? This feels a bit overly complicated to me, compared to simply checking the length of the string, but perhaps I missed the advantage this approach has?

Earlier today you said to me 'rephrased: check that the length you get matches what you can see with "wc mirrors.json"'. Then that's how i did it.

I've probably added more confusion than useful information, then (after doing the same with my mistake about whether we can check the response's body length in the get function or not). Sorry about that!

What I really meant was: when checking the size of the downloaded JSON file as requested, during your initial dev process, be extra careful that you're actually checking the data's size (as opposed to the size of some Promise object), by comparing the size you get in your code, with the size of the mirrors.json file, as seen (e.g. with wc -c) if you independently download it e.g. with wget. I'm providing an example below.

Depending on the length of domain names, this string could be quite long or quite short, we don't know. It's not a reliable measure in both cases, simply a means to not get anything nasty.

I'm sorry I might have been unclear. I'll sit down and discuss what the original problem to address was, and what the current code does.

The original problem was to "check the size of the JSON file", to protect against DoS. My understanding is that the value to check is request.responseText.length, and I guess it can be compared with max_expected_filelength without involving any RegExp, e.g. with (untested):

--- a/lib/js/mirror-dispatcher.js
+++ b/lib/js/mirror-dispatcher.js
@@ -43,7 +43,7 @@ function get(path) {
request.onload = function() {
// console.log(request.getAllResponseHeaders());
// check filesize is not too big
-      if(request.responseText.match(new RegExp(/{|}|\[|\]|weight|url_prefix/,"g")).length > max_expected_filelength) {
+      if(request.responseText.length > max_expected_filelength) {
reject(Error("Retrieved content too big"));
}

... which feels much simpler, faster, and reliable. But maybe the request.responseText is special and has no length method that's suitable for our use case? (I really wonder if I missed something, given my poor knowledge of JS.)

I tested that, and the wc thingie I mentioned above, in isolation (with a regular string, not whatever request.responseText is):

#! /usr/bin/nodejs

var dispatcher = require('mirror-dispatcher.js');
var fs = require('fs');
var path = require('path');

var mirrors_json_file = process.argv[2];

fs.readFile(mirrors_json_file, {encoding: 'utf-8'}, function(err,data){
if (!err){
    console.log(data.length);
} else {
console.log(err);
}
});

Testing the above script (you would instead do that with a debug statement in get, to print request.responseText.length):

$ NODE_PATH=lib/js bin/test ../mirror-pool/mirrors.json
3621

Verifying with wc:

$ wc -c ../mirror-pool/mirrors.json
3621 ../mirror-pool/mirrors.json

See what I mean?

OTOH, the current code does something entirely different. If I got it right, it counts, in the downloaded JSON data, the number of lines that contain a curly bracket, or a square bracket, or the "weight" string, or the "url_prefix" string (it might be that lines that contain several of those are actually counted multiple times; and even a legit file has other, non-matching lines, but whatever, those are mere details at this point). I see two problems with this approach:

  • I don't understand how it relates to the original problem: the downloaded JSON data can very well be very, very big, and trigger the DoS we're trying to protect ourselves from, while still passing this RegExp-based validation just fine. E.g. a file with few enough lines that match the RegExp the current code has, so it passes the validation, and just one more line that contains millions of chars but does not match that RegExp, so it triggers a DoS.
  • The number of matching lines we count is compared to a variable called max_expected_filelength, which is misleading: it's not the length of the downloaded file we're comparing with that variable, so it's unclear what our effective boundary (max_expected_filelength's value) really means.

What do you think?

  • I fail to understand what good the added parenthesis around (http|https):\/\/ do. Keep in mind that every pair of capturing parenthesis has some performance impact.

I tried this regexp in a tool which validates JS regexps (https://github.com/gskinner/regexr/) and that's the suggestion I got. My initial regexp did not work in there.

Well, these parenthesis are not necessary (and thus a little bit misleading), so let's remove them unless we have a good reason to keep them. The regexp is long and complex enough with them, so let's please make it simpler :) Fair enough?

In passing, regarding that regexr tool, I was curious to see if it was as buggy as your report suggested (I was all "OMG what's this crappy tool that tells us to add useless stuff to our regexps!?"), so I tried it. My experience is different. The regexp without these unnecessary parenthesis works just fine for me on http://regexr.com/ (the hosted service version of https://github.com/gskinner/regexr/):

^(http|https):\/\/[a-z0-9\-_]+(\.[a-z0-9\-_]+)+([a-z0-9\-_\.,@\?^=%&;:/~\+#]*[a-z0-9\-\_#@\?^=%&;/~\+])?$

matches the "https://mediatemple.net/", as expected. /me reassured about that tool :)

By the way, I believe I should test the whole thing in some horrible Windows browsers too. Creating a subticket for that.

Good idea => yes, please (and Safari too, since it has quite poor ES6 support). If you created that subticket already, I'm afraid I've missed it.

#32 Updated by intrigeri over 2 years ago

  • Priority changed from Normal to High

That's now the main blocker of deployment of the JS code on our website (#8642), so please prioritize this higher than #11109 so that we can proceed with the first stage of deployment :)

#33 Updated by u over 2 years ago

intrigeri wrote:

intrigeri wrote:

u wrote:

  • "But not the string length but the number of lines in that file." <-- why? This feels a bit overly complicated to me, compared to simply checking the length of the string, but perhaps I missed the advantage this approach has?

Earlier today you said to me 'rephrased: check that the length you get matches what you can see with "wc mirrors.json"'. Then that's how i did it.

I've probably added more confusion than useful information, then (after doing the same with my mistake about whether we can check the response's body length in the get function or not). Sorry about that!

Yes, I think indeed that you've added to the confusion, because your suggestion hereafter was exactly what I initially wanted to do. Maybe sometimes too much premature control is counterproductive.. (insert smiley here.)

  • I fail to understand what good the added parenthesis around (http|https):\/\/ do. Keep in mind that every pair of capturing parenthesis has some performance impact.

I tried this regexp in a tool which validates JS regexps (https://github.com/gskinner/regexr/) and that's the suggestion I got. My initial regexp did not work in there.

Well, these parenthesis are not necessary (and thus a little bit misleading), so let's remove them unless we have a good reason to keep them. The regexp is long and complex enough with them, so let's please make it simpler :) Fair enough?

Sure.

In passing, regarding that regexr tool, I was curious to see if it was as buggy as your report suggested (I was all "OMG what's this crappy tool that tells us to add useless stuff to our regexps!?"), so I tried it. My experience is different. The regexp without these unnecessary parenthesis works just fine for me on http://regexr.com/ (the hosted service version of https://github.com/gskinner/regexr/):

[...]

matches the "https://mediatemple.net/", as expected. /me reassured about that tool :)

I've tried the tool with all of our testcases from the mirror pool dispatcher test file and noticed it did not work with all of them. As said, I think we need to test all this in different crappy browsers later on, using crappy OSes too and verify that it still works there too.

By the way, I believe I should test the whole thing in some horrible Windows browsers too. Creating a subticket for that.

Good idea => yes, please (and Safari too, since it has quite poor ES6 support). If you created that subticket already, I'm afraid I've missed it.

#34 Updated by u over 2 years ago

  • Assignee changed from u to co6
  • QA Check changed from Dev Needed to Ready for QA

Hello,

I've updated the script again with the latest modifications.
One can test it here: https://451f.org/mirrors/js/mirror-dispatcher-web.html
It now used the current mirrors.json, so that we have real test cases.

Dear, co6, would you be able to reverify if all your concerns have been addressed?
Thank you.

#35 Updated by intrigeri over 2 years ago

u wrote:

I've updated the script again with the latest modifications.

Great, thank you!

#36 Updated by co6 over 2 years ago

Hi,

sorry, I am late on the review for this one. Thanks for taking the time of addressing the issues.
I have one question:
in the isValidURL() function, is there a legit reason for accepting HTTP URLs?
Otherwise, I suggest to play with HTTPS only here ;)

#37 Updated by u over 2 years ago

Hi,

currently we also use HTTP mirrors, so that's still needed.

I agree that once we switch to https only we will want to remodify this.

Cheers!

#38 Updated by intrigeri over 2 years ago

  • Subject changed from Have the mirror pool dispatcher script audited to Have the mirror pool dispatcher library audited

#39 Updated by intrigeri over 2 years ago

  • Blocks deleted (Feature #8642: Enable the mirror pool dispatcher on all website pages that need it)

#40 Updated by u over 2 years ago

Hi co6,

today I've remodified the lib again a bit, sorry. I hope that's not a problem for you.
And this time I sincerely think it should not change a lot anymore.

Cheers!

#41 Updated by anonym over 2 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

#42 Updated by u over 2 years ago

Hi co6!

I hope you are well. I just wanted to know if you're still up for this last round of reviewing?

Cheers!
u.

#43 Updated by intrigeri about 2 years ago

  • Target version changed from Tails_2.5 to Tails_2.6

#44 Updated by anonym about 2 years ago

  • Target version changed from Tails_2.6 to Tails_2.7

#45 Updated by co6 almost 2 years ago

Hi,

sorry for the delay, I finally looked at the modifications, everything seems good.

#46 Updated by intrigeri almost 2 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (co6)
  • QA Check changed from Ready for QA to Pass

Yeah! Thank you. I've bumped the submodule on our master branch to point to the latest version of the library (which only pulled commit 640e250df14d11bf1b158f37b4dda6a49363c428) => closing.

Also available in: Atom PDF