Podcast Support

Fri Nov 20 23:55:10 CET 2009

Hi.

Working on the podcast support I encountered a few things. Is not so much
questions that I have (except for a few) as stating how I implemented it or loud
thoughts about the podcast support. Still, answers to the few questions would be
helpful. :)

== Multiple Enclosures ==
All podcast feed formats support multiple enclosures. Currently this is not
handled so the 2nd overwrites the first. This is not nice. But how to handle
this? Create a 2nd podcast entry? An episode is not really a track, it's a set
of tracks, maybe a playlist itself? This would mean to change a lot. Just create
a 2nd episode that is a copy except for the file? Well it would have the same
ID, so this might be kind of a problem. I think *usually* when there are several
enclosures in one episode they are alternative versions of the same things
(different formats like mp3 Vs ogg Vs wma or different mirrors including a
bittorrent file as one of the "mirrors"). Maybe just pick the version thats
appropriate for Amarok? How do I find out which extensions are supported by Amarok?

Question: What should be done here?

== Text Format ==
The description and other text fields of podcast feeds *might* be html, xhtml or
just plain text. In atom feeds this is actually marked by the type="..."
attribute to each element containing text (can be "text", "html" or "xhtml").
Amarok does not provide a way to save this information so I made the assumption,
that the title, author etc. fields are plain text (because they are used that
way in Amarok) and that only the description field is html (because the code I
wrote to display it in the info applet makes that assumption and it seems not to
be used anywhere else).

So I have to convert whatever I get from the podcast feed to either plain text
or html. The latter is no problem, but converting html to plain text is. First
I'd have to strip all tags and then I'd have to resolve all the >250 predefined
html entities (like &auml; etc.). And because when the atom type attribute is
"html" the content is actually CDATA that makes up HTML and not Xhtml, so I
cannot parse this with a xml parser (there might be <br> instead of <br/>,
missing </p> etc.).

Is there already a function for converting html to plain text in Qt/KDE? I know
that in Qt there already is a table of all html entities in some inaccessible
internal part (and a second time in WebKit when it is compiled in). Sadly these
tables are not accessible. If I'd write such a function myself I would embed a
html entity table a 2nd (or 3rd) time. Kinda waste of memory (well, not much).
It would also bloat the PodcastReader code a bit.

Question: Should I include the entity table and do the resolving and tag
stripping by myself (won't be a problem for me)?

Another option would be to add more attributes to PodcastEpisode (and/or Track?)
that stores the information on what type the corresponding field actually has.
But that would involve changes to database tables (additional fields and that
would break existing databases?) and if fields other than the description would
get this, I guess this would involve changes to lots of parts of Amarok in order
to handle it right. So I guess not an option at all.

Apropos: For feeds that do not support a type attribute (RSS 1.0/2.0), I found
out there is already a function in Qt to guess whether it is (or might be html)
or not:
http://doc.trolltech.com/4.5/qt.html#mightBeRichText
Haven't used it yet, though.

== Fields ==
There are some fields in PodcastMetaCommon that seem not to be used and where
not even read: summary, subtitle and I think author wasn't read either (or was
it?). I do read them from the feed. In RSS 1.0/2.0 I do guessing about this
this, because there actually is only the <description> element in the standard
but there are often other elements used. I decide what to use this way:

If only the description std element is there:
description=description

If itunes:summary is there:
summary=description, description=itunes:summary
(Hm, maybe not that of a good guessing on this one, but usually description is
shorter than itunes:summary.)

In itunes:summary and body are there:
subtitle=description, summary=itunes:summary, description=body

In Atom there is no guessing:
subtitle=subtitle, summary=summary, description=content

However, subtitle and summary seem not to be used anywhere yet, or did I
overlook something?

You see that Atom seems to be an awesome format that already thinks about a lot
of cases that aren't covered by RSS 1.0/2.0. However, one thing it's missing is
some kind of <description> for *feeds*. The summary and content elements are for
episodes only, the feed element only has a subtitle child, so I guess users will
likely use a provides RSS feed instead (where we have to guess the content type
of the <description>).

But I have yet to find anything to put in the keywords field of
PodcastMetaCommon. Maybe <category>?

	-panzi