How to use SimpleXML tools to extract a URL from a subnode, when subnode is unknown? - PHP

TopAnswers PHP

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

How to use SimpleXML tools to extract a URL from a subnode, when subnode is unknown?

simplexml xpath add tag

David

**tl;dr** - how to search any sub-node for a particular string in order to extract the URL contained there?

---

I have a script that parses the [OPDS feed][o] for [Standard Ebooks][se], found at https://standardebooks.org/opds/all. I can get everything I basically need, and it produces [a simple cumulative index][ci] — fairly rough-and-ready, but that's what I'm after. So far, so good.

Sometimes, books appear in a series, however, and this is invariably noted and linked in the (long) "content" description of the ebook. **I would like to be able to extract the URL for the series, where such occurs.** The first example in the current listing is Agatha Christie's *The Murder on the Links*. The first paragraph of the long description has this:

> *The Murder on the Links* is [Agatha Christie’s](https://standardebooks.org/ebooks/agatha-christie) second [Poirot](https://standardebooks.org/collections/poirot) novel, featuring the brilliant Belgian detective and his sidekick, Captain Hastings.

The raw HTML:

```html
<i>The Murder on the Links</i> is <a href="https://standardebooks.org/ebooks/agatha-christie">Agatha Christie’s</a> second <a href="https://standardebooks.org/collections/poirot">Poirot</a> novel, featuring the brilliant Belgian detective and his sidekick, Captain Hastings.</p>
	
```

In the OPDS feed:

```xml
<content type="text/html">
  <p>
  <i>The Murder on the Links</i>
  is
  <a href="https://standardebooks.org/ebooks/agatha-christie">Agatha Christie’s</a>
  second
  <a href="https://standardebooks.org/collections/poirot">Poirot</a>
  novel, featuring the brilliant Belgian detective and his sidekick, Captain Hastings.
  </p>
  ...
</content>
```

And when parsed by PHP's SimpleXML gives this (extract only!):

```php
SimpleXMLElement Object
(
    [content] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [type] => text/html
                )
            [p] => Array
                (
                    [0] => SimpleXMLElement Object
                        (
                            [i] => The Murder on the Links
                            [a] => Array
                                (
                                    [0] => SimpleXMLElement Object
                                        (
                                            [@attributes] => Array
                                                (
                                                    [href] => https://standardebooks.org/ebooks/agatha-christie
                                                )
                                            [0] => Agatha Christie’s
                                        )
                                    [1] => SimpleXMLElement Object
                                        (
                                            [@attributes] => Array
                                                (
                                                    [href] => https://standardebooks.org/collections/poirot
                                                )
                                            [0] => Poirot
                                        )
                                )
                        )
                    [1] => In this characteristic whodunit, Poirot is summoned to a seaside town in northern France by a desperate letter from a rich businessman, who fears that he is being stalked. Poirot arrives to find the businessman already dead, his body lying facedown in an open grave on a golf course, a knife in his back—the victim of a mysterious murder. Over the coming days Poirot clashes wits with an arrogant Parisian detective, Giraud, while Hastings finds himself pining after a beautiful but shadowy American expatriate known to him only as “Cinderella.” Together, Poirot and Hastings unravel the intricate web of mystery and deceit behind the murder.
                    [2] => Christie based this mystery after a real-life French murder case, and it’s believed that this is the first detective novel to use the phrase “the scene of the crime.”
                )
        )
)
```

I can pull out the URL for the "collection" when I know where it is, with something like this:  
`$content = $list[$i]->content->p[0]->a[1][href];` — where `$list[$i]` identifies the particular ebook "node" in a `foreach` loop: it produces the "SimpleXMLElement Object" extract, above. (Often in PHP docs it would be represented by `$xml`.)

But what I need to be able to do is search anywhere in the `content` node for a paragraph that has the text `standardebooks.org/collections` (or whatever is the most sensible string for this search), to then get the full URL for the collection.

I have tried things with `xpath()`, using `//` to search any sub-node, but I have not been able to work out the correct syntax for the search.

Any help with this would be greatly appreciated! (I can provide the complete "parsed" XML for the feed, if that would be a help.)

[o]: https://specs.opds.io/opds-1.2.html
[se]: https://standardebooks.org/
[ci]: https://www.sudalyph.org/seci/

Top Answer

Jack Douglas

you can extract the collection(s) using xpath like this:

```
$xml = simplexml_load_file('https://standardebooks.org/opds/all');

foreach($xml->entry as $entry){
  $entry->registerXPathNamespace('ns','http://www.w3.org/2005/Atom');
  foreach ($entry->xpath("ns:content//ns:a[starts-with(@href,'https://standardebooks.org/collections')]") as $n=>$a) {
    if($n===0) echo PHP_EOL.$entry->title.PHP_EOL;
    echo $a['href'].PHP_EOL;
  }
}
```

which produces:

```none

The Murder on the Links
https://standardebooks.org/collections/poirot

The Mysterious Affair at Styles
https://standardebooks.org/collections/poirot

Voodoo Planet
https://standardebooks.org/collections/solar-queen

…
```

but note that you may get more than one collection, for example:

```none
The Lerouge Case
https://standardebooks.org/collections/monsieur-lecoq
https://standardebooks.org/collections/sherlock-holmes
```

you haven't specified what you will do in that case so I've left them in.

Notes on the code:

*  you need `registerXPathNamespace`, because "[To run an xpath query on an XML document that has a namespace, the namespace must be registered…](https://www.php.net/manual/en/simplexmlelement.xpath.php#115957)".
* the xpath `ns:content//ns:a[starts-with(@href,'https://standardebooks.org/collections')]` first finds the `content` nodes and then any descentant (`//`) `a` node with an `href` attribute (`@href`) that starts with the specified prefix.

1 Answer