# Readability.php ## News (August 2021) Andres Rey, the [original developer](https://github.com/andreskrey/readability.php) of Readability.php has kindly let us take over maintenance and development of the project. Please bear with us while we catch up with [Readability.js](https://github.com/mozilla/readability) changes. There'll be a new release (3.0.0) when we're ready. For the changes we've made so far in this repository, please see our [blog post](https://www.fivefilters.org/2021/readability/). ## About [![Latest Stable Version](https://poser.pugx.org/fivefilters/readability.php/v/stable)](https://packagist.org/packages/fivefilters/readability.php) [![Tests](https://github.com/fivefilters/readability.php/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/fivefilters/readability.php/actions/workflows/main.yml) PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns **title**, **author**, **main image** and **text content** without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded. ![Screenshot](https://raw.githubusercontent.com/fivefilters/readability.php/assets/screenshot.png) The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language. **Original Developer**: Andres Rey **Developer/Maintainer**: FiveFilters.org ## Code porting Master branch - Up to date on 26 August 2021, with the exception of a [piece of code](https://github.com/fivefilters/readability.php/commit/1c662465bded2ab3acf3b975a1315c8c45f0bf73#diff-b9b31807b1a39caec18ddc293e9c52931ba8b55191c61e6b77a623d699a599ffR1899) which doesn't produce the same results in PHP for us compard to the JS version. Perhaps there's an error, or some difference in the underlying code that affects this. If you know what's wrong, please feel free to drop us a note or submit a pull request. :) Version 2.1.0 - Up to date with Readability.js up to [19 Nov 2018](https://github.com/mozilla/readability/commit/876c81f710711ba2afb36dd83889d4c5b4fc2743). ## Requirements PHP 7.3+, ext-dom, ext-xml, and ext-mbstring. To install these dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments: `$ sudo apt-get install php7.4-xml php7.4-mbstring` ## How to use it First you have to require the library using composer: `composer require fivefilters/readability.php` Then, create a Readability class and pass a Configuration class, feed the `parse()` function with your HTML and echo the variable: ```php use fivefilters\Readability\Readability; use fivefilters\Readability\Configuration; use fivefilters\Readability\ParseException; $readability = new Readability(new Configuration()); $html = file_get_contents('http://your.favorite.newspaper/article.html'); try { $readability->parse($html); echo $readability; } catch (ParseException $e) { echo sprintf('Error processing text: %s', $e->getMessage()); } ``` Your script will output the parsed text or inform about any errors. You should always wrap the `->parse` call in a try/catch block because if the HTML cannot be parsed correctly, a `ParseException` will be thrown. If you want to have a finer control on the output, just call the properties one by one, wrapping it with your own HTML. ```php

getTitle(); ?>

By getAuthor(); ?>

getContent(); ?>

``` Here's a list of the available properties: - Article title: `->getTitle();` - Article content: `->getContent();` - Excerpt: `->getExcerpt();` - Main image: `->getImage();` - All images: `->getImages();` - Author: `->getAuthor();` - Text direction (ltr or rtl): `->getDirection();` If you need to tweak the final HTML you can get the DOMDocument of the result by calling `->getDOMDocument()`. ## Options You can change the behaviour of Readability via the Configuration object. For example, if you want to fix relative URLs and declare the original URL, you could set up the configuration like this: ```php $configuration = new Configuration(); $configuration ->setFixRelativeURLs(true) ->setOriginalURL('http://my.newspaper.url/article/something-interesting-to-read.html'); ``` Also you can pass an array of configuration parameters to the constructor: ```php $configuration = new Configuration([ 'fixRelativeURLs' => true, 'originalURL' => 'http://my.newspaper.url/article/something-interesting-to-read.html', // other parameters ... listing below ]); ``` Then you pass this Configuration object to Readability. The following options are available. Remember to prepend `set` when calling them using native setters. - **MaxTopCandidates**: default value `5`, max amount of top level candidates. - **CharThreshold**: default value `500`, minimum amount of characters to consider that the article was parsed successful. - **ArticleByLine**: default value `false`, search for the article byline and remove it from the text. It will be moved to the article metadata. - **StripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles. - **CleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result. - **WeightClasses**: default value `true`, weight classes during the rating phase. - **FixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`. - **SubstituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `á` to á. - **NormalizeEntities**: default value `false`, converts UTF-8 characters to its HTML Entity equivalent. Useful to parse HTML with mixed encoding. - **OriginalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs. - **KeepClasses**: default value `false`, which removes all `class="..."` attribute values from HTML elements. - **Parser**: default value `html5`, which uses HTML5-PHP for parsing. Set to `libxml` to use that instead (not recommended for modern HTML documents). - **SummonCthulhu**: default value `false`, remove all ` ``` If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string within a P tag. If you save that node, the final HTML will end up like this: ```html

'; // I should not appear on the result

``` This is a libxml issue and not a Readability.php bug. There's a workaround for this: using the `summonCthulhu` option. This will remove all script tags **via regex**, which is not ideal because you may end up summoning [the lord of darkness](https://stackoverflow.com/a/1732454). ### entities disappearing ` ` entities are converted to spaces automatically by libxml and there's no way to disable it. ### Self closing tags rendering as fully expanded tags Self closing tags like `
` get automatically expanded to `