Get data based on specific text

Hello everyone,

How can I extract data from multiple sheets if the table position is different sometimes, but I know that the column I need will contain a specific texts at the top and at the bottom, so I want to get the rows between them plus the data in next 3 columns.

Hi @aalnabelsi and welcome to the forum.

For best results, please post an example of both your input data, and what you expect your output to look like. Your description is a bit vague. Also, if you have an example workflow showing what you’ve already tried, you could upload that as well.

This topic may be useful for you:

Thanks @ScottF for your reply.

I will try to make it clear now with excel samples I uploaded in Excel (inputs/output). I have over 200 files and some files have multiple sheet from where I need to extract the data.

There is no standard format for those tables I am trying to consolidate . The only thing it might help is to identify some fixed titles (text) to use them as a reference to extract the data I need

Input
150. Bulk AAAAAA -DXXM050.xlsx (51.4 KB)

In file DXXM050 I added some comments explaining what I need to get the result and the same will be implemented on the other files

159. Bulk AAAAAA -DXXM030.xlsx (51.7 KB)

Output
Result.xlsx (22.2 KB)

I used the module below to get the data between the rows I need, but I still have add manually the column number where my reference is exist, but I want to automate the full process, so KNIME can find the tables I need to extract the data. Plus I need to know how to duplicate one cell in a new column.

image

Sorry for the complexity and if I am still not clear, but I can provide more details if required, and if there is no solution let me know please.

Thank you

Based on a quick scan of your problem, it seems quite complex - mostly because you can never be certain of the format of the incoming table. I realize that’s not your fault, but if possible I would encourage who ever is providing you the data to standardize their formatting.

If that can’t be done, you are likely going to need to do a bit of scripting using something like the OpenPxyl package in Python to search for and identify rows and columns of interest. @mlauber71 has several sample workflows dealing with that package on the KNIME Community Hub that may be of use to you. (Maybe you are already doing this data search in the Java Snippet above, but since you only provided a screenshot, I’m not sure.)

Sorry that I can’t provide more useful feedback for you.

@aalnabelsi I think you could do this with KNIME nodes by identifying the ‘blocks’ in the Excel Sheet and then applying the rules you described. The secret is in marking the start and possible end of the table/bock and then filling the empty lines with the missing values node as described here:

Please also note the linked threads there.

The sheets themselves might be identified by reading the sheets and selecting the ones that have the name part you described.

Since these seem to be common questions I think about writing a blog about them.

1 Like

Thanks @ScottF. I am working to create a new template with standard format to simplify the process. For now I am trying to find a way to consolidate the old files (over 200 files)

Thanks @willy_oracle for your help.

The start point will be finding the block contains “AA CONTRACT NO.” Java Snippet node I shared can find the first and last blocks but I still have to specify the column. I want it find the blocks without specifying the column. I will look at the details you shared and see if I can find a solution.

Here is what I included in Java Snippet

if (c_B.equals(“AA CONTRACT NO.”)) {
out_start = ROWINDEX;
} else {
out_start = null;
}

if (c_B.equals(“TOTAL AA UNIT COST”)) {
out_end = ROWINDEX;
} else {
out_end = null;
}

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.