Spire.Doc is a professional Word .NET library specifically designed for developers to create, read, write, convert and print Word document files. Get free and professional technical support for Spire.Doc for .NET, Java, Android, C++, Python.

Tue Dec 13, 2011 12:01 pm

Hello,
In our code, we need exceptional performance from Spire.Doc, where it has to parse each of millions of users' documents. The most important thing is that we don't know the format of the user's document (document type). It can be simply anything, so we have to try to parse it, and if anything comes out, we will process it for storing in an index. The idea is to have all of the text that user's document has
We get the document as a binary stream from SQL.

With spire.doc.dll version 3.5.1, I was using following code to accomplish my needs:
...
Document document = new Document();
try
{
document.LoadFromStream(binaryStream, FileFormat.Auto);
result = document.GetText();
}
catch
{
//Do nothing about documents that couldn't be parsed.
}
finally
{
document.Close();
binaryStream.Close();
}
..
It has some memory flaws, which makes memory commit size to pump up from 200mb to 72gb!, but we increased windows swap file size to compansate it.

I upgraded to version 3.7.8 today. After upgrading, the code above didn't compile. While looking to compile errors, I noticed that two of the spire.doc's only functionalities that I have been using are missing! . Those are;
1- error CS0117: 'Spire.Doc.FileFormat' does not contain a definition for 'Auto'
2- error CS1061: 'Spire.Doc.Document' does not contain a definition for 'GetText' and no extension method 'GetText' accepting a first argument of type 'Spire.Doc.Document' could be found

Note that these are the only methods we purchased spire.doc for (automatic detecting of files and getting only the text, excluding format). It has been doing its job rather well so far. How should I accomplish the same business in 3.7.8 without increasing memory usage?
Anyone can help?

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Wed Dec 14, 2011 5:14 am

Hello felix_nukem,

Sorry for any inconveniences caused by us and thank you for your patience.

We have released Spire.Doc HotFix 4.1.14. I simulated what you described with the latest HotFix successfully. Would you please try that with the latest HotFix? You can download it from the following address:http://www.e-iceblue.com/Download/download-word-for-net-now.html.


If you still have any other questions, please don't hesitate to contact us.
Have a nice day.
Tina
Technical Support/Developer,
e-iceblue Support Team
User avatar

Tina.Lin
 
Posts: 152
Joined: Tue Sep 13, 2011 5:37 am

Fri Dec 16, 2011 6:12 pm

Thanks for support, it worked. I thought that standart and pro editions are not compatible license-wise.It worked like a charm, thanks. Those commit-size jumps are not happening anymore.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Sat Dec 17, 2011 10:28 am

As I said it worked like a charm, but I made too early conclusions. After trying it in our production environment, it couldn't even parse %1 of the one job of records, while old version should've finished the same job 3 times at the same time. Ram usage is good this time, but I found cpu 100 percent, while the result is poor. When I checked, SQL was relaxing, which means spire.doc was busy parsing still very first 25000 records out of millions.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Sat Dec 17, 2011 10:34 am

Meaning to supply you with some data, I can tell that, spire 3.5 can parse around 2 millions of files with unknown types, including getting the BLOBs from sql 2008 r2, within 3 hours average. 4.1.14 couldn't parse first 25000 of the same BLOBs within about 9 hours.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Mon Dec 19, 2011 5:51 am

Hello felix_nukem,

Sorry for the inconvenience caused by us.

We can't reproduce the problem your proposed. Would you please sent some test data or test file to help us to reproduce your problem?

Thanks in advance.
Tina
Technical Support/Developer,
e-iceblue Support Team
User avatar

Tina.Lin
 
Posts: 152
Joined: Tue Sep 13, 2011 5:37 am

Mon Dec 19, 2011 5:58 am

Hello felix_nukem,

Would you please tell us your which step you came up with this problem in ?

Have a nice day.
Tina
Technical Support/Developer,
e-iceblue Support Team
User avatar

Tina.Lin
 
Posts: 152
Joined: Tue Sep 13, 2011 5:37 am

Mon Dec 19, 2011 12:17 pm

I'm sorry that I cannot supply the data, we don't have any test datas, they are all production data, which are property of users (user CVs). I will stick with 3.5 for now. Thanks for the support.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Thu Mar 08, 2012 2:30 pm

With the latest hotfix (4.1.27) , following error occures when fileformat is set to Auto:
{ai: Zip exception.Can't locate end of central directory record. Possible wrong file format or archive is corrupt.
at e5.b(Stream A_0, Boolean A_1)
at jv.b(Stream A_0, Document A_1)
at Spire.Doc.Document.i(Stream A_0)
at Spire.Doc.Document.LoadFromStream(Stream stream, FileFormat fileFormat, String password)
at Spire.Doc.Document.a(Stream A_0, String A_1)
at Spire.Doc.Document.LoadFromStream(Stream stream, FileFormat fileFormat)
----
The document is saved within windows / ms-office 2010, docx format. Word 2010 can open and read this file without any problems.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Fri Mar 09, 2012 3:01 am

Hi felix_nukem,

Thanks for your feedback.

We downloaded the file you attached, but we found that it is damaged when we extracted it. And the final docx file we get is a blank document and its size is 0KB. From your description, there is one simple line in the docx file, is it right? Would you please send the docx file again( without compressing it) to help us reproduce the issue? One more thing, if a file is blank and its size is 0KB, Spire.Doc can't load it. But if a file is blank and its size isn't 0KB, Spire.Doc can load it.

Thanks in advance.

Kind Regards.
Suvi
e-iceblue support
User avatar

Suvi.Wu
 
Posts: 154
Joined: Thu Oct 20, 2011 2:53 am

Fri Mar 09, 2012 11:21 am

Forums didn't allow me to upload docx file, so I compressed it with 7-zip 9.20 (my default settings are ultra compress using lzma-2 compression). I will rename the file and then try to upload it.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Fri Mar 09, 2012 11:22 am

Please rename the file after getting it.

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Mon Mar 12, 2012 3:29 am

Hi felix_nukem,

Thanks for your feedback and sorry for the late reply.

We have tested the file you sent with Spire.Doc v4.1.27, it can load fine. When we load a docx file whose size is 0KB, the error thrown is exactly same with yours. I attached you a project, in its root, there are your docx file and a blank docx file with 0KB size. When I load your file, it works. But the blank file with 0KB size, it throws an exception be same with your description. Please have a kind test on your computer. If it still have problem with loading your file, please send us your running environment to help us resolve it.

Thanks for your cooperation.

Have a great day.
BR
Suvi
e-iceblue support
User avatar

Suvi.Wu
 
Posts: 154
Joined: Thu Oct 20, 2011 2:53 am

Mon Mar 12, 2012 3:07 pm

As I said earlier, the valid docx we try to open is stored in sql. The code which processes the blob data is as follows:
Same data can be downloaded from website without any problems, so I am sure that incoming binaryData is valid.
So, the problem is not loading the file by calling loadfromfile. Instead, it happens when I try to feed spire with loadfromstream.

private static string GetWordCV(ref byte[] binaryData, string defaultValue, int cvID, string fileType)
{
string result = string.Empty;

if (binaryData.Length > 0)
{
using (MemoryStream binaryStream = new MemoryStream(binaryData, false))
{
Document document = new Document();
try
{
FileFormat fileFormat = GetFileFormat(fileType); //consider this as .Auto
document.LoadFromStream(binaryStream, fileFormat); //exception occurs.
result = document.GetText();
}
catch(Exception exception)
{
SOAServiceUtility.NLogger.WarnException(string.Format("The attached wordCV file for cvid:{0} couldn't be parsed. Component reports:\n", cvID), exception);
}
finally
{
document.Close();
binaryStream.Close();
}
}
}

return string.IsNullOrEmpty(result) ? defaultValue : result;
}

felix_nukem
 
Posts: 15
Joined: Tue Dec 14, 2010 1:10 pm

Tue Mar 13, 2012 4:05 am

Hi felix_nukem,

Sorry for the misunderstabding caused by me.

I made a demo which getting the docx content from access database and then feeding spire with loadfromstream. The docx file stored in database is the one you sent me another day. It works fine. Please have a look.

Hoping this can be helpful for you.

If you still have this problem, please feel free to contact us at any time.

Have a great day.

BR
Suvi
e-iceblue
User avatar

Suvi.Wu
 
Posts: 154
Joined: Thu Oct 20, 2011 2:53 am

Return to Spire.Doc