How to prevent image file contents from appearing in text search results?
When I search in the body of the text for a short word, such as 'word', the search results will display numerous messages that do not contain the word 'word'. I noticed that in _all_ of these cases there was an image displayed/attached. It occurred to me that "hits" were being registered possibly because the sequence 'word' happened to randomly show up in the ascii (or whatever) code for the image.
If I am right, how to fix it? If I am wrong, then why would I get all those false positives in the search results?
Using Thunderbird version 45.0
Note: I never tried a search until now, so the problem may occur in earlier versions.
Edeziri
All Replies (6)
You have jumped straight through to a conclusion which has set how you frame your question. I think there might be a simpler explanation.
Given that most of the world's email is generated by Windows users, and many of them use Outlook, which embeds references to Microsoft Word, is it possible that the search is finding the literal word "word" in those messages?
By your logic, almost any four-letter word might legitimately appear in encoded image data. Do you get the same hits for other search terms?
To test your theory, (and mine) you could view and save the complete source of one of the offending messages (ctrl+u) and do a search on that using your preferred search tool. (I'd use Notepad++.)
My propensity for jumping to conclusions is nothing short of scandalous, as you astutely point out. "Word" was a stupid choice, because although none of my correspondents or myself use Outlook, they might have, although one would think that the appearance of "word" would be confined to the headers, whereas I was searching in the body of the text, as indicated above.
In any case, to further test my admittedly harebrained theory, I tried the four-letter word, "twit", and since that is not a word I or my correspondents typically use, it showed up several times, but only in emails with attached images. (Not necessarily just of people, or I would have had to revise my theory.) Since one of us might conceivably have mentioned "twitter", I checked that as well, but there were no hits at all. It has three more letters, but never mind. However I then tried a word - or non-word - which I literally never use, nor does anyone I know, and that is "jklm". There were plenty of hits for that, but they all involved emails with images attached. Not necessarily the same ones, but you get the idea.
Edeziri
Whatever the word, I can't replicate your problem. The only hits I'm getting are legitimate message body text finds, including "jklm" which only appears in your last posting. Thunderbird does have an unfortunate way of searching headers and html markup which results in people reporting other invisible words such as "panose" and "calibri", almost always put there by outlook/office.
I'd be interested in the outcome of saving the message's source and searching that file for one of these ghost words.
ctrl+u to see the source, ctrl+a to select it all, ctrl+c to copy. Paste it (ctrl+v) into a new text document in notepad, wordpad or your word processor, and search it for the offending word. That way we'll know if it really is being discovered inside an encoded image.
That would be interesting and valuable software that can identify twits just from their pictures. ;-)
Your conclusion holds, I made the same search, body contains jfkl and got one hit. Searching in this mail, the only hit was within the embedded image.
So your Q stands, how to exclude embedded attachments from getting searched.
I ran various experiments.
1. I have two accounts set up in Thunderbird. I sent an email from one account to the other with an image as an attachment.
2. I repeated the above except that the image (same one) was embedded in the body of the email (copied and pasted).
I then ran a search of the sent mail in the first account, and the inbox in the second account, for a random three-letter text string. There was a hit, but only in the _embedded_ email. I opened up the source so that I could view the alphanumeric code for the image, and not surprisingly saw dozens of instances of the text string. I then did the same for the email with the image _attached_, and again saw dozens of instances. For whatever reason, however, the search only showed hits in the emails with the embedded image.
However, when I tried sending the emails with the same image from non-Thunderbird interfaces (gmail, and a mail client [Eudora]) to one of the Thunderbird accounts, I got no hits. But I suspect certain other email interfaces such as Apple Mail - which I don't have and can't experiment with - can result in hits when received in my Thunderbird accounts.
There are, of course, some very simple workarounds for the original problem, if you have it at all. Very long words will probably not result in false hits due to correspondingly long odds against it. For searches of short words, add a second rule where the body of the message does NOT contain the text string zq. The latter would virtually never appear in normal text, but would be present in virtually any image.