shell add tag
I'm trying to extract a table element from the online docs page at [^1]. 

I've extracted the XPath to the element from chrome developer tools... `//*[@id="content"]/div/article/div/table`

![Screenshot 2020-09-08 at 09.44.23.png](/image?hash=6e903bfd979e9e931938ee086192d6d135948ecfd6e432ca4364440577a59e82)

And I started by dumping the file to disk.

curl -s > index.html

Sadly, there's a JavaScript command and a few control characters that aren't playing nice with [xpath](

> not well-formed (invalid token) at line 19, column 26, byte 702:
> undefined entity at line 184, column 8, byte 12584:
> undefined entity at line 217, column 14, byte 13369:

I'm lazy and `sed` is a thing, so I stripped these out.

curl -s | \
    sed 's/&&//' | \
    sed -E 's/(&[a-z]{4});//' \
  > index.html

Sadly however, I'm still getting the following error when running `xpath index.html '//*[@id="content"]/div/article/div/table'`

> mismatched tag at line 223, column 2, byte 13451:
> </body>
> </html>
> =^
>  at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/XML/ line 187.

Which makes it look like the top-level `<html>` element is malformed. For the life of me I can't figured out where the error is. 

For what it's worth, I get the same error when removing the troublesome elements manually instead of using `sed`. 

Is there something I can do to extract this element properly instead of resorting to [string parsing my html](


[^1]: Never mind why, really. this is _very much_ several levels down on an X-Y problem where I finally have _a single_ atomic issue I can ask about. 
Top Answer
Your page isn't valid XHTML. I'll just snip out the middle to make it easier to see:

<!DOCTYPE html>
<html lang="en">
<!-- -->
  <meta charset="utf-8">


  <link href="//,italic,bold,bolditalic" rel="stylesheet" type="text/css">
  <link href="//,italic,bold,bolditalic" rel="stylesheet" type="text/css">

  <!--[if lt IE 9]>
    <script src=""></script>
<!-- end -->

Notice anything? Here's another hint: the XML parser was looking for a `</link>`.

Yes: [SGML]('s self-closing tags need to end with a `/>`, not a `>`, to be valid XML. You're lucky that the (stuff that should be in) `<head>` is the only issue; if you strip everything from `<!-- -->` to the first `<!-- end -->` before running it through the XML parser, it should work.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.