Search Support

Avoid support scams. We will never ask you to call or text a phone number or share personal information. Please report suspicious activity using the “Report Abuse” option.

Learn More

How can I get Thunderbird 38.5.1 to filter msgs with any UTF-8 in any headers?

  • 9 cavab
  • 4 have this problem
  • 1 view
  • Last reply by Gnospen

more options

No excuses for need. I just receive no valid emails from anyone using the UTF-8 characters or "utf-8" sequence in any of the headers, neither do I send any. I myself only get spam that has that string sequence trying to send non-Indo-European stuff, or from a language I don't use. Is there any way to filter the crud without waiting forever for the junk filter to catch on? Please help me weed out the vast majority of spam I can't tell the system to filter out by creating a separate filter for each of multicharacter combinations a random series generator can create.

No excuses for need. I just receive no valid emails from anyone using the UTF-8 characters or "utf-8" sequence in any of the headers, neither do I send any. I myself only get spam that has that string sequence trying to send non-Indo-European stuff, or from a language I don't use. Is there any way to filter the crud without waiting forever for the junk filter to catch on? Please help me weed out the vast majority of spam I can't tell the system to filter out by creating a separate filter for each of multicharacter combinations a random series generator can create.

All Replies (9)

more options

How much have you looked into what can be done with the filters? I thought there was a rule for anything appearing in the header. Or you can create a custom header filter field, which is usually more efficient than searching the whole header.

more options

You can add a custom header field to use in your filters. Try Content-Type, thus:

Content-Type|contains|utf-8

This appears to be useful. However, I find that it doesn't catch all instances of utf-8 encodings; many of my messages are mulitpart and the content-type is declared for each mime type segment, and these labels aren't technically in the header, and are not inserted as "Content-Type" fields.

So you may need to include a body search for "utf-8". I'm testing it out using BodyRe as supplied by FiltaQuilla (or maybe the Expression Search/ Gmail UI add-on), so the search is

BodyRe|matches|/utf-8/i

You might be able to make this more specific so it isn't triggered by innocent mention of utf-8 in the message text.

I don't see here in my email a useful correlation between utf-8 and spam. Much of my spam has no mention whatsoever of utf-8, and much of my legitimate email does use utf-8. So for me, and many people in general, I suspect this isn't a useful differentiator. Good luck.

more options

The image I just added shows a section of the source of the type of spam I was complaining of. When the To and From fields are the same, and the standard filter won't recognize anything such as a Samaritan Vowel Sign Long AA in UTF-8 pasted directly in the filter editor for either regular or custom fields, or in the body of the message text, you may rest fairly assured that any means of detecting multi-byte trash used to send me odd fonts or patterns that slip past the junk detector would be of great value to me. Searching successfully for any UTF characters with a standard filter type would most likely allow me to dampen the volume of loud junk spraying into my inbox.

I can write exception filters that put email from specific anyones with a need to still send me stuff in either the Spanish or Latin, that I passed required courses in a few years back, into a custom folder and run them before I do my final "Death to Spam" no exit type filter to not require human intervention from me even before thinning out my inbox.

Thanks for the earlier suggestions though.

more options

have a look at filtaQuilla. It has a JavaScript filter that will extend the filter to anything you can describe in javascript.

http://mesquilla.com/extensions/filtaquilla/

https://addons.mozilla.org/en-US/thunderbird/addon/filtaquilla/

more options

For those of us who have no idea where to start with javascript, you may also find that the regular expression capability introduced by the two addons I mentioned before can detect these otherwise invisible characters. I haven't been able to find it in the source code, but Thunderbird's native filters apparently ignore certain ascii characters (@ and + are known to be problematic) and these issues can be worked around using regular expressions.

Furthermore, the usage of Unicode in your sample is implied, not betrayed by an explicit content type header. The add ons also offer regular expression matching on the subject, from and to fields. And it's easy to detect ranges of characters in a regular expression. It remains to be seen if the regular expression engine can work with wide characters.

But to identify identical to and from I think you will need to write code.

Modified by Zenos

more options

To Matt, thanks for the note on the add on. I installed it and it seemed at least a bit better at matching some of the individual non-ASCII style characters if I cut and paste the example from the direct message code. Unfortunately, there is such a range of characters that fall into UTF's range that even a single byte numeric range of no no's wouldn't serve to provide a yes/no style hit on any outside the sheer number of each one put into a filter as a separate "any" style OR Boolean.

To Zenos, thanks also for the comment on the add on, and for your replies with suggestions. Unfortunately for all others, while I still continue to exist, I am rapidly becoming an old timer type.

My days of writing code involved starting with stuff such as BASIC, or 65xx machine language with statements such as A9 01 meaning load the accumulator with the literal value 1, aka LDA, #$01. I only later moved up to leaving BASIC programs running in Rat Shack COCOs that would play a tone sequence one hour after I had typed in the code and cranked the volume on the TV up all the way so everyone in the mall could enjoy the surprise burst. I also wrote my own BBS stuff back in the days of dial up modems, and would be the coordinator of my local FidoNet group and help distribute mail at late hours in the night in an early form of multi-system communication. We also shared a file or two over the systems back in those days.

I would later earn a living writing stuff for DOS, and even some forms of Windoze using ** gasp ** Pascal, and then, even later, Delphi.

Unfortunately, I never graduated up to C++, so JAVA forms of regular expressions for matching stuff that doesn't fit within a character array, specified character range, or even numeric representation of those ranges is beyond me. Getting filtaQuilla fed with the appropriate string representing one of those expressions isn't something I've done just yet. ** O^2TD **

Despite such, I would love to be able to use a fairly simple assemblage of characters to represent everything not too Indo-European so that I won't have to try to figure out if a email with a subject along the lines of, "㨮 㨸 䏟 䑈." is actually possibly something that I don't want to delete by mistake (such as from someone with one of my credit card accounts that I DO conduct business with who has multiple email sending addresses and might not be in my whitelist) , or some dirtbag sending me excrement that would serve better to fasten the base of my trashcan.

Oh well. Whine. Moan.

To all partaking of this with intent to assist, "Please take care."

more options

Can you please forward to me (as an attachment) one of these messages with the odd characters in the subject?

xenos at gmx dot co dot uk

I may have a solution for these but I don't get any to test with. I can send myself some, but then I'm making assumptions. It would be good to see real example.

more options

I've been looking at filtering for utf-8 in regular expressions and it's causing a lot of people a headache. It looks simple but I haven't cracked it yet. But it made me think of an alternative way of going about what you want. We can set up a filter the tests the subject line for any non-ASCII character. If we set it appropriately, it will tolerate the most commonly encountered accented characters too.

So,

Subject RegEx|matches|/[^\x20-\xac] /

will fire on "funny" characters in the subject line, regardless of any character encodings explicitly declared, or not, in the message or its headers. Use this as a trigger to mark the message as Junk.

The problem for me is that utf-8 characters in the subject line appear to be URL encoded, or similar, and so it seems that this stops the filter seeing them. But it is also blind to an explicit mention of "utf-8" used to alert the system to the presence of the URL encoding. So it's possible that the URL encoding is deciphered before the filter gets it, but I can't see what is then passed to the filter.

Modified by Zenos

more options

This was a very educating question, Thank you all. But any thought of using to and from fields being equal ....???

Whenever you like to send to a list, from 2 to 60 or more recipients, the common way is to put your own address in the to-field and the list-addresses in the bcc- or cc -field That way you also get a copy yourself. Could be an invitation from a friend or a subscribed news-message. Not necessarily spam.

(I started with a home-build PC with a Z80-processor using 8-bit assembler)