Project

General

Profile

tickets #80858

Re: User support data analysis: Request to export specific subforums

Added by Nycticorax 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Forum
Target version:
-
Start date:
2020-12-08
Due date:
% Done:

0%

Estimated time:

Description


Admin mailing list -- admin@lists.opensuse.org
To unsubscribe, email admin-leave@lists.opensuse.org
List Netiquette: https://en.opensuse.org/openSUSE:Mailing_list_netiquette
List Archives: https://lists.opensuse.org/archives/list/admin@lists.opensuse.org

Please excuse me, I failed my copy-paste:

We are looking for:

  • the full contents of all posts (original posts + replies)
  • along with thread titles (unless it's already included in original posts)
  • of the 2017-1-1, 2020-12-7 period
  • in the following 4 subforums:
    • Install/Boot/Login
    • Applications
    • Hardware
    • Network/Internet
  • exported to json
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-08 14:36
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-08 15:52
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-08 17:50
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-10 13:00
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-10 13:34
extract.tar.gz (1.17 KB) extract.tar.gz Nycticorax, 2020-12-11 14:50
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-11 14:50
extract.tar.gz (1.17 KB) extract.tar.gz Nycticorax, 2020-12-11 14:52
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-11 14:52
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-17 10:31
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-18 13:44
signature.asc (488 Bytes) signature.asc Nycticorax, 2020-12-18 16:18
extract.tar.gz (1.17 KB) extract.tar.gz Nycticorax, 2020-12-19 10:41
OpenPGP_0xEC20A62E5B61C287.asc (2.66 KB) OpenPGP_0xEC20A62E5B61C287.asc Nycticorax, 2020-12-19 10:41
OpenPGP_signature.asc (495 Bytes) OpenPGP_signature.asc Nycticorax, 2020-12-19 10:41

History

#1 Updated by pjessen 11 months ago

  • Subject changed from Re: User support data anyalsis: Request to export specific subforums to Re: User support data analysis: Request to export specific subforums
  • Category set to Forum
  • Private changed from Yes to No

In principle I see no issue with that, in practice it's not so easy. We don't really have much knowledge of the vBulletin internals, so writing a query to produce what you ask ....

I can offer a complete copy of the entire database, that is easily done. Takes up about 2.5Gb.

#2 Updated by malcolmlewis 11 months ago

pjessen wrote:

In principle I see no issue with that, in practice it's not so easy. We don't really have much knowledge of the vBulletin internals, so writing a query to produce what you ask ....

I can offer a complete copy of the entire database, that is easily done. Takes up about 2.5Gb.

pjessen, as long as it excludes the private moderator and admin forums.

#3 Updated by Nycticorax 11 months ago

We can easily work with a 2.5GB dump of the entire db provided it's dumped as .json or .csv. So yeah, that would be awesome!

#4 Updated by pjessen 11 months ago

  • Status changed from New to Feedback

Nycticorax wrote:

We can easily work with a 2.5GB dump of the entire db provided it's dumped as .json or .csv. So yeah, that would be awesome!

Question - are you SUSE staff or otherwise associated with SUSE ?

#5 Updated by Nycticorax 11 months ago

Question - are you SUSE staff or otherwise associated with SUSE ?

No and no, why?

#6 Updated by pjessen 11 months ago

Nycticorax wrote:

Question - are you SUSE staff or otherwise associated with SUSE ?

No and no, why?

Because it would likely have been easy to give you the access to the entire database or database dump. By having to fish out some of the parts, it becomes very difficult. We also have GDPR to consider.

#7 Updated by Nycticorax 11 months ago

I see.

Well, if you give me a full specification of the exported json I can give you a Haskell or Python script that extracts just what we need. You run the script, inspect the result, and send it to me :)

What do you say?

#8 Updated by pjessen 11 months ago

Nycticorax wrote:

I see.

Well, if you give me a full specification of the exported json I can give you a Haskell or Python script that extracts just what we need. You run the script, inspect the result, and send it to me :)

The mysqldump output format is sql, csv or xml. Does that work?
Let's ask the forum moderators if they can work with your proposal - Malcolm, Knurpht, Jim? Wrt GDPR, I am not a lawyer, we may have to ask SUSE Legal, dunno. (we can still try out your script though).

#9 Updated by Nycticorax 11 months ago

XML is probably the best. To write a small parser I'd need the fully specified schema.

But I can wait a few more days if they are nother means. It really depends if specifying the schema or finding other means is faster.

Thanks by the way, really appreciate your help on that one.
Adrien

Thanks Per, really appreciated your open-mindedness.

#10 Updated by malcolmlewis 11 months ago

pjessen wrote:

Nycticorax wrote:

I see.

Well, if you give me a full specification of the exported json I can give you a Haskell or Python script that extracts just what we need. You run the script, inspect the result, and send it to me :)

The mysqldump output format is sql, csv or xml. Does that work?
Let's ask the forum moderators if they can work with your proposal - Malcolm, Knurpht, Jim? Wrt GDPR, I am not a lawyer, we may have to ask SUSE Legal, dunno. (we can still try out your script though).

Hi Per
I did wonder about GDPR... I'm happy for a script to run and review the output with the other admins/moderators. But yes, some sort of legal review would be appropriate just to be sure.

#11 Updated by pjessen 11 months ago

  • Status changed from Feedback to In Progress
  • Assignee set to pjessen

Nycticorax wrote:

XML is probably the best. To write a small parser I'd need the fully specified schema.

I was waiting for you to say just that :-)
For starters, maybe take a look at the mysqldump man page, there is an example of xml output. The forums database has 257 tables, most of which I'm sure is of no interest to you. I don't see anything wrong in giving you the "table_structure" (see the man page) data first, for all of them.

But I can wait a few more days if they are nother means. It really depends if specifying the schema or finding other means is faster.

Thanks by the way, really appreciate your help on that one.
Adrien
Thanks Per, really appreciated your open-mindedness.

Haha, I thought your idea was the truly clever one.

#12 Updated by hendersj 11 months ago

malcolmlewis wrote:

pjessen wrote:

Nycticorax wrote:

I see.

Well, if you give me a full specification of the exported json I can give you a Haskell or Python script that extracts just what we need. You run the script, inspect the result, and send it to me :)

The mysqldump output format is sql, csv or xml. Does that work?
Let's ask the forum moderators if they can work with your proposal - Malcolm, Knurpht, Jim? Wrt GDPR, I am not a lawyer, we may have to ask SUSE Legal, dunno. (we can still try out your script though).

I guess still it kinda depends on what data is necessary; with the number of tables in the database schema, building SQL queries is pretty involved, but if we knew what the question is that we're trying to answer, there may be some built-in capabilities that will answer those questions. I think we need to better define what questions we'd like to answer, and then go from that point to see what tools need to be used to answer those questions (maybe I missed that discussion?).

Hi Per
I did wonder about GDPR... I'm happy for a script to run and review the output with the other admins/moderators. But yes, some sort of legal review would be appropriate just to be sure.

Not a lawyer, but if there's no PII in the end report (ie, if it's just aggregate data), GDPR wouldn't apply. It sounds like we really are just looking at message counts or keywords in content.

If it's for internal use only (and not being reported publicly), then GDPR shouldn't apply either, as I understand it.

#13 Updated by pjessen 11 months ago

pjessen wrote:

In principle I see no issue with that, in practice it's not so easy. We don't really have much knowledge of the vBulletin internals, so writing a query to produce what you ask ....

I can offer a complete copy of the entire database, that is easily done. Takes up about 2.5Gb.

In case anyone wonders, on the forum machine:

mysqldump --quick --single-transaction --xml -h mysql.infra.opensuse.org -P3307 webforums

For any myisam tables it might not be a 100% consistent dump, but in this context I think it'll suffice.

#14 Updated by pjessen 11 months ago

hendersj wrote:

I guess still it kinda depends on what data is necessary; with the number of tables in the database schema, building SQL queries is pretty involved, but if we knew what the question is that we're trying to answer, there may be some built-in capabilities that will answer those questions.
I think we need to better define what questions we'd like to answer, and then go from that point to see what tools need to be used to answer those questions (maybe I missed that discussion?).

I don't think we have gone into that level of detail. Adrien has outlined what he needs, and I'm trying to accommodate him without taking on too much effort.

Not a lawyer, but if there's no PII in the end report (ie, if it's just aggregate data), GDPR wouldn't apply. It sounds like we really are just looking at message counts or keywords in content.

Adrien asked for all posts and replies in certain fora, over a certain timespan. If they were emails, they would contain personal data, but maybe they don't in vbulletin ?

If it's for internal use only (and not being reported publicly), then GDPR shouldn't apply either, as I understand it.

Yeah, that's why I asked if Adrien was SUSE staff, for instance.

#15 Updated by hendersj 11 months ago

pjessen wrote:

hendersj wrote:

I guess still it kinda depends on what data is necessary; with the number of tables in the database schema, building SQL queries is pretty involved, but if we knew what the question is that we're trying to answer, there may be some built-in capabilities that will answer those questions.
I think we need to better define what questions we'd like to answer, and then go from that point to see what tools need to be used to answer those questions (maybe I missed that discussion?).

I don't think we have gone into that level of detail. Adrien has outlined what he needs, and I'm trying to accommodate him without taking on too much effort.

Yeah, he's specified what he needs, but not the question(s) he's trying to answer - at least as far as I know. From a 'minimum of effort' perspective, if we know the questions he's trying to answer can be answered within admincp, that saves everyone time. :)

Not a lawyer, but if there's no PII in the end report (ie, if it's just aggregate data), GDPR wouldn't apply. It sounds like we really are just looking at message counts or keywords in content.

Adrien asked for all posts and replies in certain fora, over a certain timespan. If they were emails, they would contain personal data, but maybe they don't in vbulletin ?

Well, there would be PII in data linked together (I'd have to look at the table structure again - it's been a while).

If it's for internal use only (and not being reported publicly), then GDPR shouldn't apply either, as I understand it.

Yeah, that's why I asked if Adrien was SUSE staff, for instance.

It might also make a difference if data that's exported contains additional PII that's not visible publicly on the forums. If the information is already public, I don't know that it's a GDPR violation to provide it again since the data's already out there.

#16 Updated by pjessen 11 months ago

Well, I tried extracting all table_structure nodes from an xml dump, just using "xmllint --xpath", but that quickly crashed:

parser error : PCDATA invalid Char value 27
parser error : Char 0x0 out of allowed range
parser error : Premature end of data in tag field

#17 Updated by Nycticorax 11 months ago

A typo slipped through, sending again with updated script.

Okay guys I am bit lost with all these emails. Can you work with the script I'm attaching to this message?

It's "napkin" scripting but at least that's something to start with. Would be cool if we could get past blockers.

#18 Updated by Nycticorax 11 months ago

Okay guys I am bit lost with all these emails. Can you work with the script I'm attaching to this message?

It's "napkin" scripting but at least that's something to start with. Would be cool if we could get past blockers.

Thanks and see you,

Adrien

Le jeudi 10 décembre 2020, 18:15:18 CET redmine@opensuse.org a écrit :

[openSUSE Tracker]
Issue #80858 has been updated by pjessen.

Well, I tried extracting all table_structure nodes from an xml dump, just using "xmllint --xpath", but that quickly crashed:

parser error : PCDATA invalid Char value 27
parser error : Char 0x0 out of allowed range
parser error : Premature end of data in tag field


tickets #80858: Re: User support data analysis: Request to export specific subforums
https://progress.opensuse.org/issues/80858#change-357680

  • Author: Nycticorax
  • Status: In Progress
  • Priority: Normal
  • Assignee: pjessen
  • Category: Forum

* Target version:


Admin mailing list -- admin@lists.opensuse.org
To unsubscribe, email admin-leave@lists.opensuse.org
List Netiquette: https://en.opensuse.org/openSUSE:Mailing_list_netiquette
List Archives: https://lists.opensuse.org/archives/list/admin@lists.opensuse.org

Please excuse me, I failed my copy-paste:

We are looking for:

  • the full contents of all posts (original posts + replies)
  • along with thread titles (unless it's already included in original posts)
  • of the 2017-1-1, 2020-12-7 period
  • in the following 4 subforums:
    • Install/Boot/Login
    • Applications
    • Hardware
    • Network/Internet
  • exported to json

---Files--------------------------------
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)

#19 Updated by Nycticorax 10 months ago

Hello,

Any news on my request? I'm sure you've got a lot on your plate already but Christmas is next corner and working on this around the celebrations is not going to be really fun for anyone. :(

#20 Updated by pjessen 10 months ago

Nycticorax wrote:

Hello,

Any news on my request? I'm sure you've got a lot on your plate already but Christmas is next corner and working on this around the celebrations is not going to be really fun for anyone. :(

Well, I thought maybe you and Jim wanted to discuss the approach first :

Yeah, he's specified what he needs, but not the question(s) he's trying to answer - at least as far as I know. From a 'minimum of effort' perspective, if we know the questions he's trying to answer can be answered within admincp, that saves everyone time.

I'll see if I can get to try out your script out, but it is a busy time of year.

#21 Updated by Nycticorax 10 months ago

Okay Per, I understand. Thanks again, and looking forward to the slightest, tiniest bit of news.

Best,

Adrien

Le vendredi 18 décembre 2020 14:33:43 CET, vous avez écrit :

[openSUSE Tracker]
Issue #80858 has been updated by pjessen.

Nycticorax wrote:

Hello,

Any news on my request? I'm sure you've got a lot on your plate already but Christmas is next corner and working on this around the celebrations is not going to be really fun for anyone. :(

Well, I thought maybe you and Jim wanted to discuss the approach first :

Yeah, he's specified what he needs, but not the question(s) he's trying to answer - at least as far as I know. From a 'minimum of effort' perspective, if we know the questions he's trying to answer can be answered within admincp, that saves everyone time.

I'll see if I can get to try out your script out, but it is a busy time of year.


tickets #80858: Re: User support data analysis: Request to export specific subforums
https://progress.opensuse.org/issues/80858#change-358926

  • Author: Nycticorax
  • Status: In Progress
  • Priority: Normal
  • Assignee: pjessen
  • Category: Forum

* Target version:


Admin mailing list -- admin@lists.opensuse.org
To unsubscribe, email admin-leave@lists.opensuse.org
List Netiquette: https://en.opensuse.org/openSUSE:Mailing_list_netiquette
List Archives: https://lists.opensuse.org/archives/list/admin@lists.opensuse.org

Please excuse me, I failed my copy-paste:

We are looking for:

  • the full contents of all posts (original posts + replies)
  • along with thread titles (unless it's already included in original posts)
  • of the 2017-1-1, 2020-12-7 period
  • in the following 4 subforums:
    • Install/Boot/Login
    • Applications
    • Hardware
    • Network/Internet
  • exported to json

---Files--------------------------------
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)
signature.asc (488 Bytes)
extract.tar.gz (1.17 KB)
signature.asc (488 Bytes)
extract.tar.gz (1.17 KB)
signature.asc (488 Bytes)
signature.asc (488 Bytes)

#22 Updated by hendersj 10 months ago

Thanks, Adrien - I saw the list of the desired data, but not what the questions were that you wanted the data to try to answer. Can you provide some clarity/context around that? When you get the data, what's the analysis that you're looking to do, and what is the goal of that analysis?

#23 Updated by Nycticorax 10 months ago

Hello,

Attila and I have created a small data pool fed with tens of thousands of oS channels (Telegram, Discord, and Reddit for the most part). We are going to perform basic analysis on this so as to highlight patterns in:

  • where users have the most issues
  • with what hardware or software do these users have the most issues
  • how long do they have to "wait" before their issues are picked up by support people, and how long it takes for a fruitful resolution (if any)
  • how many people are not noticed quickly enough, or have their issues not tackled as well as they could have
  • is there such a thing as the average oS user, and what is their profile (in very broad strokes)
  • how are issues distributed, i.e. do we have many people with a few issues or few people with many issues?

Etc, etc.

Forums are very important to us as they may provide the "best-in-class" support activity of all platforms.

Our goal is to use this to help orientate the community so as to achieve between +5 and +10% in adoption rate by next summer.

I hope this answers your question.

Best,
Adrien

#24 Updated by pjessen 10 months ago

I've looked at the extract script, let me quickly record some notes, I will likely be distracted later.

  • I'm not familiar with python, so bear with me
  • sqlalchemy needs to be installed, but our machines have no pip
  • do I need to install pip ?

It might overall be easier to grab the database dump and install it into a local sandbox machine instead of running on our production setup. I may still have a local machine that I was testing vbulletin on before the migration.

#25 Updated by Nycticorax 10 months ago

Dear Per,

Thanks for your reply!

It might overall be easier to grab the database dump and install it into a local sandbox machine instead of running on our production setup. I may still have a local machine that I was testing vbulletin on before the migration.

Absolutely! I am not sure what steps I can make to help with that,
unless you mean that I could connect to the sandbox and do my things
there. If that is doable despite potential issues (GDPR, SUSE
credentials, etc.) I am all for it.

If that is not feasible, then the README I should have attached to the
script would have read:

  1. install pip (if the system makes a distinction between python2 pip, called pip, and python3 pip, called pip3, then we are going to need pip3.
  2. install a virtual environment manager, for example pipenv: $ pip3 install pipenv --user
  3. cd into the directory obtained from unpacking my script tar.bz, and then start the virtual environment and the install the dependency: $ pipenv shell $ pipenv install
  4. modify the variables commented in extract.py. Here I cannot do much more than to refer you to my comments. I hope they'll suffice.
  5. finally the script can be run with: $ python extract.py

I am attaching the script again, it fixes one typo.

#26 Updated by pjessen 10 months ago

  • Status changed from In Progress to Resolved

Adrien and I worked on this, I believe he has the data he needed.

Also available in: Atom PDF