mimetype guessing is fooled by extension

Luciano Montanaro mikelima at cirulla.net
Sun Jul 25 13:28:16 BST 2004


On Sunday 25 July 2004 13:39, Allan Sandfeld Jensen wrote:
> On Sunday 25 July 2004 13:14, Allan Sandfeld Jensen wrote:
> > On Wednesday 21 July 2004 16:25, Luciano Montanaro wrote:
> > > I created a very big file to test the file plugins (I noticed there
> > > were problems earlier this year), and I have found that, at least, the
> > > c++ and diff file plugin are tricked in a tight loop by it. I think
> > > this kind of plugins should bail out on files of unreasonable length,
> > > however, another issue is that the file was wrongly identified as a c++
> > > file, while it does not even qualify as a text file (I don't think '\0'
> > > a valid character in a text file).
> > >
> > > "file prova.cpp" correctly says the file is a "data" file.
> > > Can't the mime identification be made smarter, using the file extension
> > > as an additinal hint instead of the only way to identify the file?
> >
> > Yes, by setting X-KDE-PatternAccuracy to <100.
> > Notice that if you open the properties for the file, it will detect the
> > content-mimetype more accurately.
> >
> > I will make take a look at the issue.
>
> Oops. One major problem. The magic(content) detection code can correctly
> detect diff, c++ and c-files. Diff will work fine by setting
> X-KDE-PatternAccuracy as suggested above, but C and C++ is detected as
> "text/x-c++" and "text/x-c" which does not exists as mimetypes in KDE (has
> "text/x-csrc" and "text/x-chdr"). What is worse is that the magic-code
> _cannot_ detect the difference between headers and source, so we end up in
> situation where a combination of patterns and magic is needed to do proper
> detection. There is currently no way to do that.
>

Well, the only way to distinguish a C/C++ include file from a regular C/C++
source is from its extension... C++ now encourages the use of extension-less
includes, and Qt 4 seems to go in that direction.
  

> A partial fix would be to add "text/x-c" and "text/x-c++" as valid
> mimetypes and let the "text/x-chdr"-type of mimetypes inherit from them. 

That would be useful.

> It 
> would mean though  that a thourough mimetype detection (with magic) would
> leed to less accurate results than a fast mimetype detection (only with
> patterns).
>

Do we really need the distinction, though?
Do editors/IDEs need the mimetype to be correct in order to work correctly?
Otherwise, to the non-programming user, there is little to be gained from a 
distinction, and programmers can, for sure, find which is which.

A better solution would be to have a rule that says "a text/x-chdr" is a
text/x-c whose name has an .h extension", but I don't know how hard would it 
be to implement such a rule.

Thanks for looking into this,
Luciano
-- 
Luciano Montanaro //
                \X/ mikelima at virgilio.it




More information about the kfm-devel mailing list