NUTCH-1749 Optionally exclude title from content field#285
NUTCH-1749 Optionally exclude title from content field#285steeveb972 wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
@steeveb972 Thanks for your contribution!
I've added some comments about your implementation that we should address before merging. Also, we need to fix the conflicts with the master branch.
| <!-- target: ant-eclipse-download =================================== --> | ||
| <target name="ant-eclipse-download" description="--> downloads the ant-eclipse binary."> | ||
| <get src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2" | ||
| <get src="http://freefr.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2" |
There was a problem hiding this comment.
I don't see a good reason to change this URL.
| * append all the content text found beneath the DOM node to the | ||
| * <code>StringBuffer</code>. | ||
| * | ||
| * <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>. |
There was a problem hiding this comment.
We return the textual content, not the element names. It should be something:
without the text from the excluded elements
| boolean abortOnNestedAnchors) { | ||
| if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) { | ||
| private boolean getText(StringBuffer sb, Node node, | ||
| boolean abortOnNestedAnchors, Set<String> excludedElementNames) { |
| boolean abortOnNestedAnchors, int anchorDepth, Set<String> excludedElementNames) { | ||
| boolean abort = false; | ||
| NodeWalker walker = new NodeWalker(node); | ||
| Set<String> lcExcludedElementNames = new HashSet<>(); |
There was a problem hiding this comment.
We should avoid duplicating the exclusion set. This method is executed many times. We could use a TreeSet when it is invoked, delegating the comparison to the Set implementation.
| * append all the content text found beneath the DOM node to the | ||
| * <code>StringBuffer</code>. | ||
| * | ||
| * <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>. |
|
Hi @steeveb972 this seems like a really useful patch. Are you able to update it and we try to get it into master? |
|
Hi, it has been a while ;-) I will have a look at it this week-end |
Hello,
Following the description and the comment provided in this jira, I propose a patch to add the ability to exclude some tags in HTML and Tika parsers.
Regards,
Steeve