add tag
Anonymous 1657
Suppose I send an e-mail to somebody. Then they respond, quoting my text in some wildly varying manner, with their own text added either on top or the bottom, or even in the middle, between two or more "chunks" of my quoted text.

Now, I'm trying to automatically extract this new content, stripping away all the other stuff. I'm trying to make it so that it will work "in the wild", in the real world, and not just in some theoretical utopia where everyone uses plaintext e-mails, all use `>` quotes and do so in a consistent manner according to formal Usenet rules by academics in 1985.

It strikes me that it perhaps would be possible to "diff" the previous e-mail in the "conversation", if known, with the new one, and extract only the difference. But I'm not sure that's reliable or the best way. I also don't know how exactly I would do that.

It seems to me that there should exist some sort of library or something to do this, just like the "MailMimeParser" one which is able to extract the plaintext and HTML parts from a scary e-mail "binary blob" which has to work with about 2,500 different RFCs and insane, convoluted rules due to how e-mail has evolved over the decades. But sadly, that library doesn't do anything about the actual contents; it only does the technical part of giving me a proper plaintext and/or HTML version of the e-mail, depending on what the person (or machine) sent to me.

How would you solve this problem? Isn't it a very common one? One example of a use for this would be to have a parser automatically interpret the response by the human who quotes a large text only to reply "OK", "Yes", "sure", "okay", etc. Instead of looking for those phrases in the entire e-mail message, which would be very unreliable if they are found anywhere in the quoted text, I want to look for it only in the "new" content, that is, not inside any kind of "quote blocks".

A "simple diff from the last e-mail" probably would not be reliable, for many reasons, and even then, I've never figured out a proper way to even do a "diff" between two strings.

I'm not sure how many different ways (in practice) that people/e-mail clients make quotes, but it doesn't seem entirely unsolvable. I'm willing to bet that this is implemented quite robustly at many companies and in many automated systems, but the question is: How do *I* or *we* do it, without having to pay money for some "enterprise" product?

I'm not a popular person, so I don't have a rich archive of received e-mails with quotes, and thus I'm unsure how many different ways of quoting there are. It's thus very difficult for me to try to come up with a solution on my own. Naturally, it will never be perfect, and some people will inevitably do something incredibly stupid to break this mechanism, but as long as it works "reasonably" well, I would be content.

Of course, there's also the problem that they might be having quotes from some third-party source included in their reply... Then, any solution which strips away any quotes without checking for contents in previous e-mails would remove that quote, even though it was intended as part of the reply rather than "quoting what the other party said"...

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.