weixin_39848953
weixin_39848953
2020-11-30 19:28

Random TNS-12535 (Probably connection timeout)

  1. What versions are you using? Database => 12.1.0.2.0

process.platform => Linux process.version => v12.16.1 process.arch => x64 require('oracledb').versionString => 4.2.0 require('oracledb').oracleClientVersionString => 19.6.0.0.0

  1. Is it an error or a hang or a crash? More like error

  2. What error(s) or behavior you are seeing? Randomly connection to database fails with below error. My application is a monitoring system that connect to many hosts/databases and check their status by checking if its reachable or not. (No need to pass authentication). I've noticed that always there is problem with three databases only. Rest of the databases seems to be fine. This is annoying as this is monitoring application and cause false alerts. Just to clarify I dont want to use pools etc, every connection should be established from scratch to check if the database is up or not. This was only one idea that come on my mind to check whether database is up or not and if its responding for new connections, maybe you have better ideas? :)


Fatal NI connect error 12170, connecting to:
 (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=SID)(CID=(PROGRAM=app.js)(HOST=TARGET_HOSTNAME)(USER=USER)))(ADDRESS=(PROTOCOL=tcp)(HOST=IP_ADDRESS)(PORT=PORT)))

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.6.0.0.0
  Time: 05-OCT-2020 02:15:03
  Tracing to file: /XXX/trace_log.xml_24195.trc
  Tns error struct:
    ns main err code: 12535
    TNS-12535: Message 12535 not found; No message file for product=network, facility=TNS
    ns secondary err code: 12579
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0


module.exports = async (DB_SID, DB_HOSTNAME, DB_PORT, DB_CONNECTION_ID) => {
    let connection
    let TIMEOUT_DELAY = __config.oracle_db_connection_timeout + __config.mon_promise_timeout


    let dbConfig = {
        user: __config.oracle_defaultUsername,
        password: __config.oracle_defaultPassword,
        connection = `${DB_HOSTNAME}:${DB_PORT}/${DB_SID}`
    };


    return new TimeoutRejectPromise(TIMEOUT_DELAY, (resolve, reject) => {
        oracle.getConnection(dbConfig, function (error, conn) {
            if (error) {
                switch (error.errorNum) {
                    case 1017:
                        if (__config.oracle_noAuthAccept) {
                            // Return OK if connection was established but authentication not passed
                            return resolve()
                        }

                        return reject()
                    case 28000:
                        if (__config.oracle_noAuthAccept) {
                            // Return OK if connection was established but user is locked
                            return resolve()
                        }

                        return reject()
                    default:
                        log.error(logConn + 'Unexpected error ', error);
                        return reject()
                }
            }

            conn.close(function (error) {
                if (error) {
                    log.error(logConn + 'Error closing connection ', error)
                    return reject()
                } else {
                    log.info(logConn + 'Connection closed.');
                    return resolve()
                }
            })
        })
    }).then(function (res) {
        return true;
    }).catch(function () {
        return false;
    })
}

Additionally this is my current sqlnet.ora -


TRACE_LEVEL_CLIENT=16
TRACE_DIRECTORY_CLIENT=/XXX/traces
TRACE_FILE_CLIENT=trace_log.xml
TRACE_UNIQUE_CLIENT=ON
TRACE_TIMESTAMP_CLIENT=TRUE
DIAG_ADR_ENABLED=ON
ADR_BASE=OFF
SQLNET.INBOUND_CONNECT_TIMEOUT=3
SQLNET.OUTBOUND_CONNECT_TIMEOUT=3
SQLNET.RECV_TIMEOUT=3
SQLNET.SEND_TIMEOUT=3
SQLNET.EXPIRE_TIME=1
TCP.CONNECT_TIMEOUT=3
LOG_DIRECTORY_CLIENT=/XXX/traces
LOG_FILE_CLIENT=sqlnet.log

Databases are on different hosts than the application. I strongly believe the problem might be in my sqlnet ora adjustments. Not sure whether they are fine or not. I was trying to tunning them to make sure that no connection longer than 3 seconds happens as it would delay whole monitoring, also if database do not respond on 3 seconds then actually it's not really working properly database 👍

Not sure whether there can be any Linux(OS) parameters that can improve the connection to databases, or maybe sqlnet parameters that can improve that.

Thank you so much in advance.

P.s. I'm very sorry for marking it as bug accidentally, I doubt it's a bug. Unfortunately I can't change it now.

该提问来源于开源项目:oracle/node-oracledb

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

21条回答

  • weixin_39620037 weixin_39620037 5月前

    The secondary error code 12579 may be indicating the host is unreachable, you can ping or tnsping to verify the target host is reachable. Since tracing is on as configured in the sqlnet.ora file, you can look at the trace file to identify the actual cause. Or you upload the trace file for further examination.

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    Hi ,

    first of all, thank you for response. I'm getting out of ideas what it can be. So from the database perspective, there is almost no chances that it would be unreachable, there is application (another) that constantly using this database to process data. (it's a huge application) and if the host would be unreachable for sure I would see it in application logs, but I've checked it and there is no such errors. Database was for 99% all time available, oracle DB Server logs also shows no errors.

    Additionally, that host which I am connecting to contains more than one database installed, there are like 15 databases, and ONLY 3 of them are problematic in most cases. And all of them are being checked in the same time (async check). So there is like no chances that host was unreachable.

    About traces, unfortunately those parameters that I have did not generate trace file somehow, as per sqlnet.log traces should be created but they do not exists. Ill check and tunning params to get the traces. I'll provide them soon.

    P.s. Also there is no chance that the host was unreachable, application before doing oracle connection doing also telnet test, and the telnet test passed (few ms before oracle connection)

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前
    • in your sqlnet.ora set DIAG_ADR_ENABLED=OFF so that trace files will be written to the directory you specify. We need to see both the client AND server trace to know what is wrong. Here is how I recommend setting up tracing, including how to find the server trace file that corresponds to the client trace:

    n your sqlnet.ora file on the client, add these parameters (adjust the path to make sense for your environment):

    DIAG_ADR_ENABLED = off TRACE_LEVEL_CLIENT = 15 TRACE_UNIQUE_CLIENT = on TRACE_DIRECTORY_CLIENT = \network\log TRACE_FILE_CLIENT = client

    in your sqlnet.ora file on the server, add these parameters (adjust the path to make sense for your environment):

    DIAG_ADR_ENABLED=off TRACE_DIRECTORY_SERVER=/tmp/oracle TRACE_FILE_SERVER=oracle

    TRACE_LEVEL_SERVER=15

    leave TRACE_FILE_SERVER commented until you are ready to do your test to generate the trace files (to keep from getting too many trace files). Once you are ready to begin your test, uncomment TRACE_FILE_SERVER, run your test, and immediately comment it again to stop further generation of server trace files.

    Examine the client trace file - look for a string like this: (1) [29-APR-2020 10:55:44:848] nas_ccn: Connection ID: 002b88f9fc7514

    now grep for the Connection ID in your server trace directory and find the corresponding server trace file. In the example above, you'd look for 002b88f9fc7514 so you'd use:

    cd /tmp/oracle grep 002b88f9fc7514 -d skip ./*

    Note: In the client trace file you may see multiple connection IDs. This is because someone tried, failed, and tried again or just hit return. We usually see three since the SQL*Plus default is to quit after three failures. I only need the server trace matching the FIRST connection attempt.

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前
    • in my previous response I asked for client and server traces - lets just start with the client trace. If you are getting an ORA-12579 there is a very good chance the client is never reaching the database server's listener, in which case no server trace would be generated. So if you'll just post a level 15 or level 16 client trace that should let me start trying to see what's going on.
    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    I really appreciate your response. I've set traces on client side, now it seems to be working and traces are being gathered.

    So far the problem happened today just twice at night, but it was before I enabled traces. Ill come back immediately once I'll have traces.

    Do you know if its possible to set traces for specific connection only? I am just curious. I notice that the traces are being gathered even if there are no issues with the connection.

    I'll try to split later on logs to provide only from failed execution.

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前
    • it isn't possible to set traces for a specific connection, the trace file settings apply to all connections made by the client (or to the server if you're using server tracing) that are governed by that sqlnet.ora file.

    One way you could come close to connection-specific tracing would be to set the environment variable TNS_ADMIN prior to making the connection(s) you are interested in. TNS_ADMIN lets you specify a different set of network configuration files for the client, so you could have tracing turned on in a sqlnet.ora in one directory, and turned off in the default directory.

    For a little bit more information about TNS_ADMIN, please check here: https://docs.oracle.com/en/database/oracle/oracle-database/19/netag/managing-network-address-information.html#GUID-30CC3A4B-A60E-4AF0-BD4E-DD7D79DB51BF

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    Hi,

    Please see below attached client trace, Unfortunately the trace is for all connections so there is little bit of data... And unfortunately I had to scramble the data a little bit. Problem appear for ports in log file: 1525 1541 1545

    Thank you in advance for any kind of support.

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前

    Sorry, too many errors and I can't figure out which are pertinent. The trace you sent is also a partial file, missing the header information

    Lets try this - create a connection to JUST one of the databases that are failing and upload that trace file.

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    The problem is that the connection do not fail for all the time. Today for example it failed just once (and the check is running every 5 minutes).

    I am limited with way to leave just one connection running for whole day.

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前

    oh, I see - I thought you were saying that those three databases ALWAYS failed. If it's intermittent, then it's most likely either a network connectivity issue, or your listener just isn't responding quickly enough. Are you able to insert a few seconds of wait time before each connection?

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    Let me shortly describe how it works.

    Every 5 minutes application check availability of for now around 200 destinations of multiple types. TCP Port check, URL Check, Oracle database availability etc.

    Each of the check is running asynchronous, and each of the target is accessed just once per 5 minutes. Those 3 (randomly failing databases) are on the same host where are around 15 other databases which do not have similar issues.

    I can try increase timeouts for the checks, currently it's 3 seconds for all types of check, I though that will be pretty enough as it's not any rocket science job, especially when all of the destinations are under internal network (not outside).

    Can I somehow tunning the sqlnet.ora or that function which I posted in my first post? Do you maybe have some ideas/clues/suggestions?

    点赞 评论 复制链接分享
  • weixin_39859220 weixin_39859220 5月前

    I think this is what you have on your client:

    SQLNET.INBOUND_CONNECT_TIMEOUT=3 --Not needed, this is a server-side parameter SQLNET.OUTBOUND_CONNECT_TIMEOUT=3 --this is VERY short, the unit is milliseconds. It's your second most likely culprit SQLNET.RECV_TIMEOUT=3 --also very short, but the error message makes it unlikely this is the culprit. SQLNET.SEND_TIMEOUT=3 --also very short, but the error message makes it unlikely this is the culprit. SQLNET.EXPIRE_TIME=1 - probably not an issue TCP.CONNECT_TIMEOUT=3 -this is VERY short, the unit is milliseconds. It's your most likely culprit

    You might try this to see if you get better results: SQLNET.OUTBOUND_CONNECT_TIMEOUT=300 SQLNET.RECV_TIMEOUT=30 SQLNET.SEND_TIMEOUT=30 SQLNET.EXPIRE_TIME=1 TCP.CONNECT_TIMEOUT=300

    点赞 评论 复制链接分享
  • weixin_39638929 weixin_39638929 5月前

    SQLNET.EXPIRE_TIME was traditionally a server side setting. With C clients (like node-oracledb) it can be added as (EXPIRE_TIME=n) to DESCRIPTION section of a connect descriptor in 18c. In 19c can be used via Easy Connect Plus syntax like. host/service?expire_time=n. Update: sqlnet.expire_time can be used in the "client side"(aka Node.js) sqlnet.ora from Oracle Client libraries 20c onwards: see the doc.

    But I don't think this setting has any value for a connect-disconnect flow to check the DB is reachable.

    What you may potentially find useful is using the Easy Connect Plus syntax to set the options instead of fiddling with a tnsnames.ora file. See the technical paper Oracle Database 19c Easy Connect Plus Configurable Database Connection Syntax.

    点赞 评论 复制链接分享
  • weixin_39638929 weixin_39638929 5月前

    You may also want to look at the retry_count and retry_delay options. These go into tnsnames.ora descriptors, or as Easy Connect Plus options.

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    Thank you guys for suggestions, really appreciate.

    Just now I'll implement changes into sqlnet.ora as per below -

    
    SQLNET.OUTBOUND_CONNECT_TIMEOUT=100
    SQLNET.RECV_TIMEOUT=10
    SQLNET.SEND_TIMEOUT=10
    TCP.CONNECT_TIMEOUT=100
    

    I just wonder if parameter UV_THREADPOOL_SIZE can be any issue here, currently it's set to 64. At this moment I have 34 Oracle database connections at one time, but I think this limit also affect other connections types (for other DB's, or TCP connections). What do you think?

    I'll also check for the Easy Connect, however I already have something like retry_count created, if there will be any failed results from the check, application will re-attempt those checks one more time to make sure that the destination is for sure down.

    In most cases those failed DB's connection getting cleared after second attempt, but still sometimes they still fails. I think with the time when application will become bigger I'll need about somehow splitting the checks in some parts... :<

    点赞 评论 复制链接分享
  • weixin_39638929 weixin_39638929 5月前

    Side note: sqlnet.expire_time can additionally be used in the "client side"(aka Node.js) sqlnet.ora from Oracle Client 20c onwards: see the doc.

    Re UV_THREADPOOL_SIZE, how many concurrent connections to Oracle DB are ever open at the same time? That should be the minimum value, and add extra for whatever other work might be using worker threads.

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    Hi,

    Currently to Oracle DB it's around 35 at the same time. But I wonder if it affect other connections like HANA Database. I'll review the doc you provided shortly.

    点赞 评论 复制链接分享
  • weixin_39848953 weixin_39848953 5月前

    almost 1 day after increasing parameters, so far looks stable. All the checks usually finish within max ~1.5 sec, but now after increasing parameters sometimes it take 7-10 seconds due to the oracle checks. I think I can accept that, I wanted to avoid reaching more than 5 seconds for all the checks but seems I have to live with that.

    点赞 评论 复制链接分享
  • weixin_39620118 weixin_39620118 5月前

    I think this is what you have on your client:

    SQLNET.INBOUND_CONNECT_TIMEOUT=3 --Not needed, this is a server-side parameter SQLNET.OUTBOUND_CONNECT_TIMEOUT=3 --this is VERY short, the unit is milliseconds. It's your second most likely culprit SQLNET.RECV_TIMEOUT=3 --also very short, but the error message makes it unlikely this is the culprit. SQLNET.SEND_TIMEOUT=3 --also very short, but the error message makes it unlikely this is the culprit. SQLNET.EXPIRE_TIME=1 - probably not an issue TCP.CONNECT_TIMEOUT=3 -this is VERY short, the unit is milliseconds. It's your most likely culprit

    You might try this to see if you get better results: SQLNET.OUTBOUND_CONNECT_TIMEOUT=300 SQLNET.RECV_TIMEOUT=30 SQLNET.SEND_TIMEOUT=30 SQLNET.EXPIRE_TIME=1 TCP.CONNECT_TIMEOUT=300

    Just wanted to verify, you said here that SQLNET.OUTBOUND_CONNECT_TIMEOUT and TCP.CONNECT_TIMEOUT are in milliseconds, but the Oracle docs say it's in seconds at the below?

    https://docs.oracle.com/cd/B28359_01/network.111/b28317/sqlnet.htm#NETRF427 https://docs.oracle.com/database/121/NETRF/sqlnet.htm#NETRF417

    Looks like these docs show seconds as well for RECV_TIMEOUT and SEND_TIMEOUT?

    点赞 评论 复制链接分享
  • weixin_39875842 weixin_39875842 5月前

    This issue has been automatically marked as inactive because it has not been updated recently. It will be closed if no further activity occurs. Thank you for your contributions.

    点赞 评论 复制链接分享
  • weixin_39875842 weixin_39875842 5月前

    This issue has been automatically closed because it has not been updated in 28 days.

    点赞 评论 复制链接分享

相关推荐