[Nepomuk] Review Request 109811: Metadata extractor for archive files (.zip, .tar.*, .ar)
Vishesh Handa
me at vhanda.in
Sun May 5 13:39:11 UTC 2013
> On May 5, 2013, 11:14 a.m., Vishesh Handa wrote:
> > services/fileindexer/indexer/archiveextractor.cpp, line 147
> > <http://git.reviewboard.kde.org/r/109811/diff/3/?file=127299#file127299line147>
> >
> > I'm not sure if this is a good idea.
> >
> > Why would you want to clear the indexed data of the zip url? It typically will not have any data stored.
>
> Denis Steckelmacher wrote:
> When I started working on this indexer, this line wasn't present. The problem was that after I launched nepomukindexer 6 or 7 times on the same Zip file (to get debug output), nepomukshow listed some duplicate properties. If I recall correctly, it was NFO::uncompressedSize that was displayed more than once.
>
> Another problem that I see in this indexer is that it indexes ArchiveItems every run. Does Nepomuk have some sort of garbage collector or any mechanism that prevents multiple ArchiveItems to be indexed for the same URL ? During my testing, nepomukshow never returned more than one result when I tried to display the information known to Nepomuk about an archive item, but does that really mean that there are no orphaned archive items ?
Yeah, so you need to mark each ArhiveItem as a subresource. You will do this like this -
res.addProperty( NAO::hasSubResource(), archiveItemRes );
for each archiveItem.
- Vishesh
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/109811/#review32064
-----------------------------------------------------------
On April 1, 2013, 5:36 p.m., Denis Steckelmacher wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/109811/
> -----------------------------------------------------------
>
> (Updated April 1, 2013, 5:36 p.m.)
>
>
> Review request for Nepomuk.
>
>
> Description
> -------
>
> This patch adds a file metadata extractor for archive files. This extractor handles any file that can be read using KArchive.
>
> The metadata extracted are the uncompressed size of the whole archive (shown in Dolphin, but not formatted like a file size using KB or MB suffixes), and the list of files it contains. The extractor creates one Nepomuk resource per file or directory in the archive (root directory included). These resources have the types ArchiveEntry, and FileDataObject (for files) or Folder (for directories). They also have their nie:url property set to an URL that can be used with the Archive KIO (for instance, "zip:/home/me/archive.zip/one/file" or "tar:/usr/src/linux-3.7.2.tar.xz"). For files, their fileSize is set to the uncompressed size of the file.
>
> The files themselves are not read nor uncompressed. I haven't found a way to recursively extract metadata of archived files (for instance, launching the PlainTextExtractor on any plain text file found in the archive).
>
>
> Diffs
> -----
>
> services/fileindexer/indexer/CMakeLists.txt 97bedfd
> services/fileindexer/indexer/archiveextractor.h PRE-CREATION
> services/fileindexer/indexer/archiveextractor.cpp PRE-CREATION
> services/fileindexer/indexer/nepomukarchiveextractor.desktop PRE-CREATION
>
> Diff: http://git.reviewboard.kde.org/r/109811/diff/
>
>
> Testing
> -------
>
> nepomukindexer seems to work. Nepomukshow displays meaningful information about the files indexed, the archive itself and the files contained in it. For a test archive, nepomukshow displays these informations :
>
> $ nepomukshow test.zip
> <nepomuk:/res/e5eddbdb-995b-472f-9ef1-3a4ba4c9999d> # Note this ID
> rdf:type nfo:FileDataObject
> rdf:type nfo:Archive
> rdf:type nie:InformationElement
> nao:created 2013-04-01T13:57:16.586Z
> nao:lastModified 2013-04-01T13:57:17.414Z
> nie:lastModified 2013-02-28T20:49:24Z
> nie:url file:///home/steckdenis/test.zip
> nie:mimeType application/zip
> nie:created 2013-02-28T20:49:24Z
> nfo:fileSize 3368744
> nfo:uncompressedSize 4171547
> nfo:fileName test.zip
> kext:indexingLevel 2
>
> Displaying the metadata of a file contained in the archive can be done by passing an URL to nepomukshow :
>
> $ nepomukshow 'zip:/home/steckdenis/test.zip/'
> <nepomuk:/res/71458f55-898c-4374-ad00-6ac5b1d9c9e7> # Note this ID, it is the one of the root compressed directory
> rdf:type nfo:ArchiveItem
> rdf:type nfo:Folder
> rdf:type nfo:FileDataObject
> rdf:type nfo:DataContainer
> nao:created 2013-04-01T13:57:17.416Z
> nao:lastModified 2013-04-01T13:57:17.416Z
> nie:url <zip:/home/steckdenis/test.zip/>
> nie:created 1970-01-01T00:00:00Z
> nfo:belongsToContainer nepomuk:/res/e5eddbdb-995b-472f-9ef1-3a4ba4c9999d # ID of the archive file itself
>
> $ nepomukshow 'zip:/home/steckdenis/test.zip/6 My account1.png'
> <nepomuk:/res/ed73aabc-ce18-4ac7-9db7-f301ce07ffc5>
> rdf:type nfo:ArchiveItem
> rdf:type nfo:FileDataObject
> nao:created 2013-04-01T13:57:17.417Z
> nao:lastModified 2013-04-01T13:57:17.417Z
> nie:url <zip:/home/steckdenis/test.zip/6%20My%20account1.png>
> nie:created 2012-11-21T08:21:08Z
> nfo:fileSize 330923 # Uncompressed size
> nfo:belongsToContainer nepomuk:/res/71458f55-898c-4374-ad00-6ac5b1d9c9e7 # ID of the root directory
>
> When entering "6 My account1.png" in KRunner, the file is shown as an "Archive entry". When clicking on it, Gwenview is launched and displays the image.
>
>
> Thanks,
>
> Denis Steckelmacher
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130505/e881304e/attachment-0001.html>
More information about the Nepomuk
mailing list