#74 Store language of entries as indicated by the feed

Merged
fox merged 3 commits from tkappe/pullreq-store-language into fox/master 1 year ago
tkappe commented 1 year ago

Currently, tt-rss ignores the language of entries in a feed. This can lead to awkward hyphenation when displaying the content. This patch stores the language of the entry as indicated by the feed in the database, so text can be rendered correctly.

Currently, tt-rss ignores the language of entries in a feed. This can lead to awkward hyphenation when displaying the content. This patch stores the language of the entry as indicated by the feed in the database, so text can be rendered correctly.
fox commented 1 year ago
Owner

this sounds like an edge case an absolute minority of users would ever notice, i.e. not worth adding and maintaining code to main tree.

stuff like should be handled by plugins.

this sounds like an edge case an absolute minority of users would ever notice, i.e. not worth adding and maintaining code to main tree. stuff like should be handled by plugins.
tkappe commented 1 year ago
Poster

I do not consider this an edge case, but rather an issue of correctly implementing and displaying feeds. If the feed indicates that the entry is in a certain language, then surely displaying the item in that language should be baseline functionality?

In particular for articles in Dutch, words are currently hyphenated in the middle of "ij", which makes for extremely awkward reading.

The code necessary to do this is also tiny, so I do not see the issue of maintaining.

I do not consider this an edge case, but rather an issue of correctly implementing and displaying feeds. If the feed indicates that the entry is in a certain language, then surely displaying the item in that language should be baseline functionality? In particular for articles in Dutch, words are currently hyphenated in the middle of "ij", which makes for *extremely* awkward reading. The code necessary to do this is also tiny, so I do not see the issue of maintaining.
fox commented 1 year ago
Owner

The code necessary to do this is also tiny, so I do not see the issue of maintaining.

i've looked through my feed cache and this language attribute in feeds is (as usual with feeds) all over the place, while html lang= attribute seemingly expects ISO639-1 language codes.

unless i'm missing something you haven't implemented any sanity checking, most importantly length (it's a 2 char field in the schema) so first feed specifying something like "en-US" in there is going to break feed updates.

e: the only source for this database field was af_lang_detect plugin which can only return valid language codes.

i also have a feed which has stuff like "[space][space][space]ru[space][space][space]" in there, yes, with multiple spaces.

>The code necessary to do this is also tiny, so I do not see the issue of maintaining. i've looked through my feed cache and this language attribute in feeds is (as usual with feeds) all over the place, while html lang= attribute seemingly expects ISO639-1 language codes. unless i'm missing something you haven't implemented any sanity checking, most importantly length (it's a 2 char field in the schema) so first feed specifying something like "en-US" in there is going to break feed updates. e: the only source for this database field was af_lang_detect plugin which can only return valid language codes. i also have a feed which has stuff like <code>"[space][space][space]ru[space][space][space]"</code> in there, yes, with multiple spaces.
fox commented 1 year ago
Owner

additionally, i didn't like that do-while syntax in the atom class. it's a minor personal thing tho.

anyway, at least a trim() and a mb_substr(...,0,2) is required there.

additionally, i didn't like that do-while syntax in the atom class. it's a minor personal thing tho. anyway, at least a trim() and a mb_substr(...,0,2) is required there.
fox reopened 1 year ago
tkappe commented 1 year ago
Poster

Very well, I propose to do the following:

  • Some normalization on the language value, i.e., lower case and trimming.
  • Sanity checking, i.e., check whether it is an ISO639-1 code (this should guarantee that it fits in two characters).
  • Refactor the do-while into a while loop (alternatively, I could just check the element and its direct parent, because those are the only ones involved here).

Would that be enough to have this be reconsidered?

Very well, I propose to do the following: * Some normalization on the language value, i.e., lower case and trimming. * Sanity checking, i.e., check whether it is an ISO639-1 code (this should guarantee that it fits in two characters). * Refactor the do-while into a while loop (alternatively, I could just check the element and its direct parent, because those are the only ones involved here). Would that be enough to have this be reconsidered?
fox commented 1 year ago
Owner

yes, this looks good to me.

incorrect language code is not going to break anything, so i'm not sure if checking if its a valid language is even needed, as long as it's a two character string.

i'm not against this change in general, you're right in that it's relatively minor and something that most of the plumbing is already done for already anyway. my first reaction was that it belongs with with af_lang_detect plugin, i guess.

yes, this looks good to me. incorrect language code is not going to break anything, so i'm not sure if checking if its a valid language is even needed, as long as it's a two character string. i'm not against this change in general, you're right in that it's relatively minor and something that most of the plumbing is already done for already anyway. my first reaction was that it belongs with with af_lang_detect plugin, i guess.
tkappe commented 1 year ago
Poster

OK - I implemented sanitization and removed the do/while-loop.

OK - I implemented sanitization and removed the `do/while`-loop.
fox commented 1 year ago
Owner

thanks, looks good!

thanks, looks good!
fox commented 1 year ago
Owner

Uncaught Error: Call to undefined method DOMProcessingInstruction::getAttributeNS() in /home/fox/public_html/tt-rss/classes/feeditem/atom.php:208 Stack trace: #0 /home/fox/public_html/tt-rss/classes/rssutils.php(640): FeedItem_Atom->get_language() #1 /home/fox/public_html/tt-rss/classes/rssutils.php(190): RSSUtils::update_rss_feed(8186, true, false) #2 /home/fox/public_html/tt-rss/update.php(199): RSSUtils::update_daemon_common(50) #3 {main} thrown

uhhhh

feed in question is "https://www.gocomics.com/peanuts" which is handled by af_comics plugin (i think it generates XML on fetch), maybe firstChild is null or something.

e: it's not null it's just not a DOMElement i guess.

another feed is a blogspot blog: http://crpgaddict.blogspot.com/feeds/posts/default

>Uncaught Error: Call to undefined method DOMProcessingInstruction::getAttributeNS() in /home/fox/public_html/tt-rss/classes/feeditem/atom.php:208 Stack trace: #0 /home/fox/public_html/tt-rss/classes/rssutils.php(640): FeedItem_Atom->get_language() #1 /home/fox/public_html/tt-rss/classes/rssutils.php(190): RSSUtils::update_rss_feed(8186, true, false) #2 /home/fox/public_html/tt-rss/update.php(199): RSSUtils::update_daemon_common(50) #3 {main} thrown uhhhh feed in question is "https://www.gocomics.com/peanuts" which is handled by af_comics plugin (i think it generates XML on fetch), maybe firstChild is null or something. e: it's not null it's just not a DOMElement i guess. another feed is a blogspot blog: http://crpgaddict.blogspot.com/feeds/posts/default
fox commented 1 year ago
Owner

yeah it's a stylesheet

object(DOMProcessingInstruction)#43 (18) {
  ["target"]=>
  string(14) "xml-stylesheet"
  ["data"]=>
  string(39) "type="text/xsl" href="atom-to-html.xsl""
  ["nodeName"]=>
  string(14) "xml-stylesheet"
  ["nodeValue"]=>
  string(39) "type="text/xsl" href="atom-to-html.xsl""

assuming first entry is a DOMElement is probably not the best idea

yeah it's a stylesheet <pre> object(DOMProcessingInstruction)#43 (18) { ["target"]=> string(14) "xml-stylesheet" ["data"]=> string(39) "type="text/xsl" href="atom-to-html.xsl"" ["nodeName"]=> string(14) "xml-stylesheet" ["nodeValue"]=> string(39) "type="text/xsl" href="atom-to-html.xsl"" </pre> assuming first entry is a DOMElement is probably not the best idea
tkappe commented 1 year ago
Poster

Right - I should have thought of that. I see you have already added a fix. I put an alternative fix on my branch which determines the feed element by simply looking at the parent of the entry element.

Right - I should have thought of that. I see you have already added a fix. I put an alternative fix on my branch which determines the feed element by simply looking at the parent of the entry element.
This pull request has been merged successfully!
Sign in to join this conversation.
No Label
No Milestone
No assignee
2 Participants
Loading...
Cancel
Save
There is no content yet.