Log in

View Full Version : xml embed references. halp! :(


upb
November 8th, 2007, 01:00
Hi.

Could anyone please offer any tips/pointers/ideas on how to optimize the following:

For each element in document, that has a href attribute and no children elements and no text children nodes,
find the element with id = the value of that href attribute and move all children elements and attributes (except id) to the referencing element, and remove the referenced element.

The current implementation takes 0,09 sec per root level detached element on a 10MB xml document, which amounts to about 3 minutes.
That is bearable but the situation is hopeless with more elements ...

Code:
protected static void EmbedRefs(ref XmlNode n, ref XmlNode container)
{
int i = 0;
XmlNodeList detachedChildren = n.SelectNodes("child::*[@href and not(text())]";
foreach (XmlNode child in detachedChildren)
{
string href = child.Attributes["href"].Value.Substring(1);
XmlNode referee = container.SelectSingleNode("child::*[@id='" + href + "']";

EmbedRefs(ref referee, ref container);

child.Attributes.RemoveNamedItem("href";

while (referee.FirstChild != null)
{
XmlNode fc = referee.FirstChild;
referee.RemoveChild(fc);
child.AppendChild(fc);
}

foreach (XmlAttribute attr in referee.Attributes)
{
if (attr.LocalName == "id"
continue;

child.Attributes.Append((XmlAttribute)attr.Clone());
}

referee.ParentNode.RemoveChild(referee);
}
}

thx to anyone who takes a look

Pii
November 8th, 2007, 21:33
strange implementation of the tree. such copying should take O(1) time with simple pointer swapping.
I understand this piece of code takes O(tree size^2) time.
traverse the whole tree and make a dictionary with href -> node pointer elements.
with good dictionary implementation time should drop to O(nlogn) (O(n) for dictionary precomputation).

upb
November 9th, 2007, 10:26
Big Thanks Pii for your suggestion

I found what the bottleneck was:
the repeated call to find the referee for each referrer

stats for the new versions are:
9.11.2007 17:21:36 select detached >
9.11.2007 17:21:36 <. have 44940 items
9.11.2007 17:21:37 build map >
9.11.2007 17:21:37 <. took 0,7344925sec
9.11.2007 17:21:37 embed >
9.11.2007 17:22:11 <. took 33,724145sec

The improvement comes from pulling the referrers and referees with a single XPath operation..


I'll test with different amounts of data to discovere the time complexity,
it seems nothing can be assumed of ms's dom implementation (i would think selecting a direct descendant by an attribute would be cheap.)
and my inability to think about most basic properties of algorithms before implementing them :P

upb
November 9th, 2007, 16:28
I was wondering about the 30sec for embed operation.. and

First we refeactor a bit by extracting the embed operation for a referrer
Code:

foreach (DictionaryEntry reference in references)
{
XmlNode referrer = (XmlNode) reference.Key;
XmlNode referee = (XmlNode) reference.Value;

EmbedNode(referee, referrer);
}


This of course doesn't change the performance.

Move the referee finding to the Embed part:
Code:

foreach (XmlNode child in affectedNodes)
{
if (child.Attributes["href"] != null)
{
references.Add(child, null);
}
}

Code:

foreach (DictionaryEntry reference in references)
{
XmlNode referrer = (XmlNode) reference.Key;
string href = referrer.Attributes["href"].Value.Substring(1);
XmlNode referee = (XmlNode)IdToNode[href];

EmbedNode(referee, referrer);
}


This doesn't change the performance either, why should it,
but why use a dictionary when we only hold values and iterate it?
Change it to an ArrayList:
Code:

ArrayList references = new ArrayList();

and
Code:

..
references.Add(child);
..

fix the iteration aswell:
Code:

foreach (XmlNode referrer in references)


This change doesnt affect performance either, right?
wrong:
Code:

9.11.2007 23:09:12 embed >
9.11.2007 23:09:13 <. took 0,6407275sec


I suspect it has to do with modifying objects that are used as Hashtable keys maybe..

Pii
November 9th, 2007, 17:42
mate, there is no magic in calculating complexity.
if I understand correctly, what you did before is:
for each node, traverse whole tree to find node with "href" value matching href from the first node.
if there are n nodes, then your execution time is n*n*c where c is time needed for one "visit node" operation.
you have to visit each node anyway, so the only place to optimize is the searching. instead of searching whole tree each time, you can remember (href, [node1,..,nodeN]) pairs in some "fast" data structure like dictionary. then you can simply ask the dictionary for the node set.

ofc maybe I'm missing something here - I admit I haven't read your code.

upb
November 9th, 2007, 20:00
Yes i agree, but there is a limitation i didnt mention before (which make it not O(n*n), but still your comment helped to get me thinking):

All the elements with @id are children of a known element (container), so it's not needed to search the whole tree, just direct children of a node.

That's why i thought the operation of finding the referee would be cheap.

The number of @href elements equals the number of @id elements (m), so
it's smth like O(m* // times embedRefs is called
(m + // select referee
n/m + // select detached children
c // move children
)) = O(m*m + n)

But thx again

LLXX
November 10th, 2007, 03:50
I'd first parse the input XML to generate a hash table of the element names to offsets within the file, then copy each element in the hash table to the output XML, replacing the hrefs with their content as they're copied and marking the element name in the table so it doesn't get copied. This is O(N) time complexity where N is the total number of elements in the input.