PeterVandivier
I'm trying to extract a table element from the online docs page at https://www.openssl.org/docs/man1.1.1/man1/ [^1].
I've extracted the XPath to the element from chrome developer tools... `//*[@id="content"]/div/article/div/table`
![Screenshot 2020-09-08 at 09.44.23.png](/image?hash=6e903bfd979e9e931938ee086192d6d135948ecfd6e432ca4364440577a59e82)
And I started by dumping the file to disk.
```bash
curl -s https://www.openssl.org/docs/man1.1.1/man1/ > index.html
```
Sadly, there's a JavaScript command and a few control characters that aren't playing nice with [xpath](https://www.unix.com/man-page/osx/1p/xpath/)
> not well-formed (invalid token) at line 19, column 26, byte 702:
>
> undefined entity at line 184, column 8, byte 12584:
>
> undefined entity at line 217, column 14, byte 13369:
I'm lazy and `sed` is a thing, so I stripped these out.
```bash
curl -s https://www.openssl.org/docs/man1.1.1/man1/ | \
sed 's/&&//' | \
sed -E 's/(&[a-z]{4});//' \
> index.html
```
Sadly however, I'm still getting the following error when running `xpath index.html '//*[@id="content"]/div/article/div/table'`
> mismatched tag at line 223, column 2, byte 13451:
>
> </body>
> </html>
> =^
> at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/XML/Parser.pm line 187.
Which makes it look like the top-level `<html>` element is malformed. For the life of me I can't figured out where the error is.
For what it's worth, I get the same error when removing the troublesome elements manually instead of using `sed`.
Is there something I can do to extract this element properly instead of resorting to [string parsing my html](https://stackoverflow.com/a/1732454/4709762)?
---
[^1]: Never mind why, really. this is _very much_ several levels down on an X-Y problem where I finally have _a single_ atomic issue I can ask about.
Top Answer
wizzwizz4
Your page isn't valid XHTML. I'll just snip out the middle to make it easier to see:
```html
<!DOCTYPE html>
<html lang="en">
<!-- head.inc -->
<title>
/docs/man1.1.1/man1/index.html
</title>
<meta charset="utf-8">
*snip*
<link href="//fonts.googleapis.com/css?family=PT+Serif:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<link href="//fonts.googleapis.com/css?family=PT+Sans:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<!--[if lt IE 9]>
<script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!-- end -->
*snip*
<body>
*snip*
</body>
</html>
```
Notice anything? Here's another hint: the XML parser was looking for a `</link>`.
Yes: [SGML](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language)'s self-closing tags need to end with a `/>`, not a `>`, to be valid XML. You're lucky that the (stuff that should be in) `<head>` is the only issue; if you strip everything from `<!-- head.inc -->` to the first `<!-- end -->` before running it through the XML parser, it should work.