-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make http response example more robust #460
base: main
Are you sure you want to change the base?
Make http response example more robust #460
Conversation
@caitmuenster Here is the PR that I was curious about; you can see the question I posted on Discourse as well if you're curious. Thanks! (wingman-jr-addon and jessetrana are both me here, zephyr on Discourse) |
Thanks @wingman-jr-addon! Adding this to @Rob--W's queue for review. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an example we need to strike a balance between completeness of code and simplicity of code. Currently it is not simple, nor complete - see my comments below.
http-response/background.js
Outdated
// taken on by the browser. Here, a simplified approach is taken | ||
// and the complexity is hidden in a helper method. | ||
let decoder, encoder; | ||
[decoder, encoder] = detectCharsetAndSetupDecoderEncoder(details); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is complicated but still not a full solution to the problem.
For simplicity, I recommend to check whether the response is (X)HTML (to ignore responses such as images, media, downloads), and only proceed if needed. Then the MIME/charset detection logic can greatly be simplified (see also comment below about charset detection).
You should also check that the HTTP status code is OK before attaching a stream filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The note about the complex logic is discussed in the other relevant comments below, but duly noted.
I agree it would be better to move the logic for detecting the supported MIME types as a pre-check here.
- Pre-check MIME types
Also agreed on the HTTP status code check.
- Add HTTP status code check
http-response/background.js
Outdated
// Just change any instance of Example or Test in the HTTP response | ||
// to WebExtension Example or WebExtension Test. | ||
let mutatedStr = fullStr.replace(/Example/g, 'WebExtension Example'); | ||
mutatedStr = mutatedStr.replace(/Test/g, 'WebExtension Test'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the second one. It is not necessary for the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary for the example. I added it because the W3C test suite pages do not have the word "Example" but do have "Test" so it shows that the plugin is mutating the pages when viewing the test suite - I think that's a critical aspect to preserve. However, in the name of simplicity, I could remove the "Example"/example.com from the addon; I was trying to keep a nod to the original behavior but perhaps it just feels kludgy now.
- Remove extraneous Example/example.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like to replace both, you could do /Example|Test/
to demonstrate how one can replace multiple words (and if you wish to prepend "WebExtension", use 'WebExtension $&'
. If you don't know about $&
, see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_string_as_a_parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think your first comment made me realize that I had a conflict of "make everyone happy" or "have a better focused example" and so I'm choosing the latter and keeping it simple. As venerable as example.com is, it gets enough press elsewhere.
http-response/background.js
Outdated
filter.disconnect(); | ||
filter.onerror = e => { | ||
try { | ||
filter.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to explicitly handle errors here?
filter.onerror
could occur when the response is redirected (when you've failed to check for status codes, as you did before),
and also repeatedly when an out-of-memory issue occurs - https://bugzilla.mozilla.org/show_bug.cgi?id=1404381
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - this part of the error handling isn't particularly relevant to the heart of the example either.
- Remove explicit error handling
http-response/background.js
Outdated
// character - the stream parameter is critical for success as this | ||
// method gets called multiple times. | ||
let str = decoder.decode(e.data, {stream: true}); | ||
fullStr += str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffering everything before flushing is simple, but not user friendly.
Instead of buffering everything at once, you can consider flushing all but some penultimate bytes/characters. After all, if the string to replace is at most 7 characters, then you can safely flush everything except the last 6 characters, to handle this case:
[prefix this can be flushed]Exampl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. This is an artifact of the code I was porting this from (where I needed to do a regex search of unknown length) and can be improved. I think the best solution for this would be a streaming regex transformer - I don't suppose something along those lines exists without external library support? That would make my day. Otherwise, I will change it as requested.
- Streaming replacement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need an external library for this, the logic can be written in a concise way. See the brief explanation below for the algorithm. It is definitely more involved than buffering all and flushing at once though. If the example becomes very unwieldy if this were to be implemented, then buffering may not be that bad of an idea.
Instead of using replace
with a regexp and a string, you can use the .exec
method of a regular expression with the g
flag to find matches. Flush the prefix, append the replacement, and continue (if you are cutting the string, reset lastIndex
to 0
to resume from the start, if you aren't cutting the string you can just call .exec
again and it will continue searching from position lastIndex
in the string).
When you don't find a match, and know that the regular expression has a maximum number of matches (such as 8
in case of /Example|Test/g
), then cut the remaining string in two parts: Keep the last 7 characters (=max max length minus one) for the next time and flush the rest (this is what I explained in my previous comment).
Note: when your regular expression does not have a unique match (e.g. /A{1,3}/g
or [0-9]+
), then you need extra logic: to make sure that as much is matched as possible, when the regular expression matches the end of the string collected so far, you should not flush and keep the string for the next iteration, in case the regexp is going to match more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to go for the simpler static string streaming replacement, and added comments around its limits.
http-response/background.js
Outdated
// 1) Detects the charset for the TextDecoder so that bytes are properly turned into strings | ||
// 2) Ensures the output Content-Type is UTF-8 because that is what TextEncoder supports | ||
// 3) Returns the decoder/encoder pair | ||
function detectCharsetAndSetupDecoderEncoder(details) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"detect and set up decoder/encoder" suggests that it just does detection and creating new instances of the decoder/encoder.
The method however also mutates the response headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't like it either. I believe that the real issue here is that TextEncoder does not support anything other than UTF-8, so we must force the content type on the output, which is awkward. As noted in the PR description, the code would benefit from having TextEncoder support more types to avoid this hack. Can you think of a way to preserve the original content encoding without the TextEncoder polyfill? I'd rather bypass the mutation altogether because it seems a bit treacherous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For search keywords that are compatible with the used encodings (e.g. letters from the latin alphabet have the same bytes in utf-8, iso-8859-2, ..), you don't need to worry about using TextEncoder. For fixed-length search terms, you could convert the string to bytes and search and replace on typed arrays instead of strings. This may be even more performant than converting to strings and back.
But for the example, I think that it is fine to convert to UTF-8. First, it shows the significance of character encodings.
For those who are not that well versed with character encodings, it is a safe default to choose, because utf-8 is very common.
} | ||
|
||
// Detect the charset from Content-Type | ||
function detectCharset(contentType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The charset detection logic as implemented below is incorrect (e.g. when multiple charset
parameters are set, the last one should be used; when multiple MIME types are set, the previously set charset should be forgotten). Let's keep the example simple, however, and make it more obvious that the implementation is incomplete.
Here is an example of a more complete implementation that also covers some edge cases: Rob--W/open-in-browser@a6b926e
As you can see, it's quite involved - do not put the code in the PR. You can link to it if you wish though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's wisdom here; thanks for the link.
I've been thinking about your comment about an obviously incomplete implementation. While the idea of a complete solution here is appealing, the simplified solution will likely cover 99% of the cases and the comment will make the careful reader aware of the problem. Lack of awareness of this problem - albeit obvious in retrospect - is why I'm wanting to contribute back a more robust example in the first place; a user reported a bug to me in a review of my addon. I'd like others to avoid the same fate, as their users may not be as considerate as mine was.
- Make incompleteness obvious
Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the details
object? It just seems like right now it would be incredibly difficult to create a truly correct implementation - do you know of any extant addons that are actually correct in this regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the
details
object?
What a good idea. I have thought of it before and posted a feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479
Unfortunately the feature request was denied. If you post a new bug that shows use cases and why alternatives aren't a good fit, the feature request can be re-evaluated.
It just seems like right now it would be incredibly difficult to create a truly correct implementation - do you know of any extant addons that are actually correct in this regard?
The Open in Browser add-on that I linked before was authored by me, and modelled after Firefox's implementation (which includes non-standard behavior).
http-response/background.js
Outdated
// Historically, detecting character encoding has been a tricky task | ||
// taken on by the browser. Here, a simplified approach is taken | ||
// and the complexity is hidden in a helper method. | ||
let decoder, encoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the TextDecoder
and TextEncoder
constructors here as before to improve readability. You can declare a charset
variable if you want to use something else instead of UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, except that I need to potentially modify the Content-Type output to be UTF-8 if the input type is not already; that would drive having something like detectCharset
and forceCharset
as separate calls. Would you find that organization to be preferable to the problematic detectCharsetAndSetupDecoderEncoder
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the simplest case where the Content-Type needs to be an exact match for something that looks like `text/html; charset=utf-8", then you don't need to change the response header. But even if you do change the header, then you can still improve the readability by keeping the main logic in the listener, and using small helper functions to avoid too large function bodies. For example:
function listener(details) {
let {responseHeaders} = details;
let ctHeader = responseHeaders.find(h => h.name.toLowerCase() == "content-type");
// TODO: Verify that ctHeader is supported.
// If Content-Type header not set, the browser is going to do content-sniffing,
// and we should also return to avoid trouble (e.g. breaking downloads, PDFs, videos, ...).
if (content type is not supported) {
return;
}
let charset = ...
let decoder = new TextDecoder(charset);
let encoder = new TextEncoder();
etc...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another run at the code. I was able to clean it up in a few places as per your comments - keep the header object reference rather than index, return earlier/don't guess, etc. Still, a full half of the code is dedicated to character encoding alone; I can't say that the code is elegant.
Thanks @Rob--W for taking the time to look at this and provide helpful feedback, and thanks @caitmuenster for getting this going. I haven't gotten much feedback from the community on this aspect of using WebRequest and yet I think it is so vital to any addon trying to filter text. I'll take a look at your comments; I think you might have some better ideas on how to solve some of the problems I'm facing but it may take a round or two of feedback until I get into the right spot. Thanks! |
@Rob--W I've taken another pass at the code and have closed out the open tasks I set up on the review. Ultimately, I still dislike the code, but I think it's an improvement and thanks for your help. Let me know if it passes muster. I'd also like to promote to the top level the conversation we had going in the sub-thread about e.g. exposing the detected MIME type. We were saying:
Also, the link to your original GitHub issue in there has quite valuable implementation details so I'd like link to it as well: https://github.com/Rob--W/open-in-browser/issues/5 I'd like to discuss this issue more. I think - assuming there is enough use of As I see it, there are many possible depths of solutions that could work to varying degrees:
So I'd probably lean towards something more in the vein of option 2 or option 3; I think option 3 has the potential to be more problematic from an implementation standpoint. After pondering the problem for another 3 years, what thoughts have you had on the matter? |
Also, I did a little research tonight. I searched GitHub for "TextEncoder" and "filterResponseData" and surveyed many of the results. Not exactly the most scientific - and I likely have errors in my hasty analysis - but here are my rough findings:
Those are rather abysmal results, but not surprising. This indicates to me that there is a demonstrated need for awareness as well as mechanisms to make some level of correctness easier - character encoding seems like something that should be a minor footnote for the average addon developer. |
@Rob--W I know this will be testing the memory a little but, based on your comment, are you happy about the example? If you are, I will test and make sure it works as expected and then merge the changes. |
@rebloor @Rob--W I saw that activity popped up on this and thought I'd chime in a bit as I've gotten more experience with the issue as my addon has matured. Unfortunately it hasn't translated to the work here (yet?) as I thought this effort was dead. If there's interest I can take another crack at freshening this up. In short, I now think that probably the best way to handle the issue is to provide a special linked pair of TextDecoder/TextEncoder that hide the detection altogether and allow the developer to just feed in the stream, get a string, and reencode the string. You still need the raw bytes in many cases but additionally for text handling a built-in like this would help greatly. The big reason for this approach is primarily because I've now seen many of the different methods that can in theory show up actually do get reported as bugs - including more difficult ones like the (One further note/update: not having TextEncoder support encoding into all the encodings TextDecoder can support is a frustrating limitation as I can't simply sniff the charset, decode, and then encode cleanly - this is part of what would be nice with the TextEncoder pair being linked.) I'd also be curious to hear how @Rob--W 's thoughts have evolved now over the past 6 years since his original feature request. |
Oops, this fell through the cracks. I'll perform a review once this is freshed up. (The rest of the text below is part of the ongoing discussion about a feature request for Firefox; you can stop reading if you are only interested in the PR)
This sounds very convenient from the extension developer's point of view, but a huge burden to the extension framework. Due to content sniffing, the target encoding may sometimes be dependent on the data itself (e.g. binary vs something else). And sometimes there is no lossless way to encode one in another.
I suppose that you are primarily interested in replacing specific parts of the content without changing the encoding. When the byte sequence is unambiguous (the keyword and replacement), you could also operate on the bytes directly, without using TextEncoder/TextDecoder.
Non-utf8 has little utility in web browsers, and having encoders for non-utf8 encodings would bloat the implementation and result in a larger application binary, as well as a larger attack surface. So while I understand the convenience to developers again, I also recognize the issue with having this functionality natively. Note - if you really want to there is of course nothing stopping you from having encoders the other way around again, but that's probably not optimal.
On https://bugzilla.mozilla.org/show_bug.cgi?id=1425479 ? When I filed that issue I was an external contributor with experience as an extension developer, and browser engineering in Firefox and Chromium (as a volunteer). Now I have been working for a few years as a browser engineer at Mozilla, and one thing that stands out is that there are significantly more ideas than time available to implement them all. Therefore even if an idea on its own seems useful, it may not make the cut due to prioritization. Especially if the feature is niche but requires considerable engineering effort and maintenance. |
@wingman-jr-addon please ping Rob when you've had the opportunity to address his feedback and make any other changes. |
a0377db
to
6a5b6da
Compare
I've been using WebRequest filtering for quite some time now in one of my plugins, and while it's easy to get started with toy examples, I've found that there is some significant complexity that must be addressed when trying to do any sort of text manipulation in a bit more robust manner. (Some of these deficiencies are discussed in #364 , partially addressed in the MDN docs ) I thought extending this extension would be the right place to start as I would have wished to know these details when I started my plugin.
This extension now demonstrates three important things:
.ondata
Content-Type
.As an observation, browser support for two things would clean this up tremendously:
webRequest
details (at least insofar as it is detectable based on headers) would avoid manual, poor detection logic like that I have here.TextEncoder
so that the original character encoding can be preserved rather than always converting to UTF-8.At any rate, let me know what you think! I hope this will help provide a bit stronger starting point for others using
webRequest
.